This workbook was used to create content for the blog post on investigation of European Social Study but is being updated every now and then as well. (HTLM version to show off active content via Rpubs and Azure. Since Rpubs seems to have some size limitations, the exploratory graphs on indicators and countries are not included. Current Azure subscription has not these limitations so there we have all content. But loading of the page may take some time..)

Change notes in the very endt * * *

Europe is going through interesting time since turn of the century: Brexit, refugee crisis, rise of populism, economic worries, rising tensions between the West and Russia - and a lot more Timeline of 21st century

European Social Study has been collecting data on people in most countries in Europe every two years since 2002. The purpose of this study is look into the recent history of Europe mainlywith the help of ESS data. As additional data we will use World Bank Development indicator set and a data set from UNHCR to provide some reference to ESS indicators.

We started by looking into subjective happiness, how it has changed over time and investigated what factors in ESS data are associated with it by using linear modeling.

From that we continued looking into other indicators which either were significantly associated with subjective happiness or otherwise were seen as interesting and important items for this study

Finally, we used Primary Component Analysis and k-means clustering to find best number of country clusters based on their similarity in the selected ESS variables.

Findings

Several historic events can be see in the data

Combined data set has several interesting features which may be source of some future blog posts, especially


European Social Survey

As described in the abstract of its eighth round (conducted 2016-2017)

The European Social Survey (ESS) is an academically-driven multi-country survey, which has been administered in over 30 countries to date. Its three aims are, firstly – to monitor and interpret changing public attitudes and values within Europe and to investigate how they interact with Europe’s changing institutions, secondly - to advance and consolidate improved methods of cross-national survey measurement in Europe and beyond, and thirdly - to develop a series of European social indicators, including attitudinal indicators.

In the eighth round, the survey covers 23 countries and employs the most rigorous methodologies. From Round 7 it is funded by the Members, Observers and Guests of ESS European Research Infrastructure Consortium (ESS ERIC) who represent national governments. Participating countries directly fund the central coordination costs of the ESS ERIC, as well the costs of fieldwork and national coordination in their own country.

The survey involves strict random probability sampling, a minimum target response rate of 70% and rigorous translation protocols. The hour-long face-to-face interview includes questions on a variety of core topics repeated from previous rounds of the survey and also two modules developed for Round 8 covering Public Attitudes to Climate Change, Energy Security, and Energy Preferences and Welfare Attitudes in a Changing Europe (the latter is a partial repeat of a module from Round 4).

In total there are over 500 indicators. However, all of them are not collected in all surveys, countries or in each interview.

Use of dataset is free for use as stated in ESS codebook

The data are available without restrictions, for not-for-profit purposes

Starting point for this investigation is the question on subjective happiness

Taking all things together, how happy would you say you are?

resulting in the indicator happy with range from zero (Extremely unhappy) to ten (Extremely happy).


Setup

# load key libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape2)
library(car)
library(Hmisc)
library(corrplot)
library(countrycode)
library(ggrepel)
library(GGally)
library(ggiraph)       # Interactive plot versions for ggplot2
library(viridis)       # red-green colour blind compatible palettes


# must set false for github_markup 
plot_interactive <- params$html_interactive

source("ess_subs.r")

my_data_folder = "C:\\Users\\Juha\\Documents\\MyData\\"

# reference p-value for statistics

p_val = 0.05

ESS Data

Main dataset used in this investigation covers the period 2002-2016. It was created merging a 2002-2014 dataset created with ESS Cumulative Data Wizard and a separately downloaded 2016 set since the latter was not in the wizard at the time.

Raw Data

Reading and merging ESS survey data from the two downloaded sets and keeping only those indicators which are in both.

dataset = GetData(data_folder = my_data_folder, use_rds = TRUE)
dim(dataset)
## [1] 376258    228
names(dataset)
##   [1] "name"       "essround"   "edition"    "idno"       "cntry"     
##   [6] "ppltrst"    "pplfair"    "pplhlp"     "polintr"    "trstprl"   
##  [11] "trstlgl"    "trstplc"    "trstplt"    "trstprt"    "trstep"    
##  [16] "trstun"     "vote"       "contplt"    "wrkprty"    "wrkorg"    
##  [21] "badge"      "sgnptit"    "pbldmn"     "bctprd"     "clsprty"   
##  [26] "prtdgcl"    "lrscale"    "stflife"    "stfeco"     "stfgov"    
##  [31] "stfdem"     "stfedu"     "stfhlth"    "gincdif"    "mnrgtjb"   
##  [36] "freehms"    "euftf"      "imsmetn"    "imdfetn"    "impcntr"   
##  [41] "imbgeco"    "imueclt"    "imwbcnt"    "happy"      "sclmeet"   
##  [46] "sclact"     "crmvct"     "aesfdrk"    "health"     "hlthhmp"   
##  [51] "rlgblg"     "rlgdnm"     "rlgblge"    "rlgdnme"    "rlgdgr"    
##  [56] "rlgatnd"    "pray"       "dscrgrp"    "dscrrce"    "dscrntn"   
##  [61] "dscrrlg"    "dscrlng"    "dscretn"    "dscrage"    "dscrgnd"   
##  [66] "dscrsex"    "dscrdsb"    "dscroth"    "dscrdk"     "dscrref"   
##  [71] "dscrnap"    "dscrna"     "ctzcntr"    "ctzshipc"   "brncntr"   
##  [76] "cntbrthc"   "livecnta"   "lnghom1"    "lnghom2"    "blgetmg"   
##  [81] "facntr"     "fbrncntb"   "mocntr"     "mbrncntb"   "gvrfgap"   
##  [86] "hhmmb"      "gndr"       "gndr2"      "gndr3"      "gndr4"     
##  [91] "gndr5"      "gndr6"      "gndr7"      "gndr8"      "gndr9"     
##  [96] "gndr10"     "gndr11"     "gndr12"     "agea"       "yrbrn"     
## [101] "yrbrn2"     "yrbrn3"     "yrbrn4"     "yrbrn5"     "yrbrn6"    
## [106] "yrbrn7"     "yrbrn8"     "yrbrn9"     "yrbrn10"    "yrbrn11"   
## [111] "yrbrn12"    "rshipa2"    "rshipa3"    "rshipa4"    "rshipa5"   
## [116] "rshipa6"    "rshipa7"    "rshipa8"    "rshipa9"    "rshipa10"  
## [121] "rshipa11"   "rshipa12"   "rshpsts"    "lvgptnea"   "dvrcdeva"  
## [126] "marsts"     "maritalb"   "chldhm"     "chldhhe"    "domicil"   
## [131] "edulvlb"    "eisced"     "eduyrs"     "pdwrk"      "edctn"     
## [136] "uempla"     "uempli"     "dsbld"      "rtrd"       "cmsrv"     
## [141] "hswrk"      "dngoth"     "dngref"     "dngdk"      "dngna"     
## [146] "mainact"    "mnactic"    "crpdwk"     "pdjobev"    "pdjobyr"   
## [151] "emplrel"    "emplno"     "wrkctra"    "estsz"      "jbspv"     
## [156] "njbspv"     "wkdcorga"   "iorgact"    "wkhct"      "wkhtot"    
## [161] "nacer2"     "tporgwk"    "isco08"     "wrkac6m"    "uemp3m"    
## [166] "uemp12m"    "uemp5yr"    "mbtru"      "hincsrca"   "hinctnta"  
## [171] "hincfel"    "edulvlpb"   "eiscedp"    "pdwrkp"     "edctnp"    
## [176] "uemplap"    "uemplip"    "dsbldp"     "rtrdp"      "cmsrvp"    
## [181] "hswrkp"     "dngothp"    "dngdkp"     "dngnapp"    "dngrefp"   
## [186] "dngnap"     "mnactp"     "crpdwkp"    "isco08p"    "emprelp"   
## [191] "wkhtotp"    "edulvlfb"   "eiscedf"    "emprf14"    "occf14b"   
## [196] "edulvlmb"   "eiscedm"    "emprm14"    "occm14b"    "atncrse"   
## [201] "ipcrtiv"    "imprich"    "ipeqopt"    "ipshabt"    "impsafe"   
## [206] "impdiff"    "ipfrule"    "ipudrst"    "ipmodst"    "ipgdtim"   
## [211] "impfree"    "iphlppl"    "ipsuces"    "ipstrgv"    "ipadvnt"   
## [216] "ipbhprp"    "iprspot"    "iplylfr"    "impenv"     "imptrad"   
## [221] "impfun"     "regunit"    "intewde"    "dweight"    "pspwght"   
## [226] "pweight"    "ess_year"   "cntry_name"

So, the dataset contains at this point over 376 000 rows and almost 230 columns. Variable details available in ESS codebook.

Survey coverage

We know from ESS portal already that all countries have not been in all surveys

# survey coverage
ds_participations <- dataset %>% 
  group_by(cntry_name, ess_year) %>%
  summarise(n=n()) %>%
  mutate(n_ave = mean(n, na.rm = TRUE),
         first_year = min(ess_year),
         last_year = max(ess_year),
         n_surveys = n()) %>% 
  spread(ess_year, n)

# On geographic map
p <- CreateEuroMap(dsin = ds_participations %>% ungroup(), 
                   ind_name = "n_surveys",
                   txt_title = "ESS has some gaps in geographic and time coverage",
                   txt_subtitle = "Number of surveys a country has participated in",
                   txt_caption = "ESS surveys 2002-2016")
MyPrintInteractive(p, plot_interactive)

So, there is a band of countries starting from Finland in the north eastern corner of Europe, stretching all the way to Spain and Portugal in the far south western corner, which have been in all or nearly all surveys.

p <- ggplot(data=ds_participations, 
            aes(x=reorder(cntry_name, -n_surveys))) +
  geom_hline(yintercept = 4.1, color="red", alpha=0.5) +
  geom_bar_interactive(stat="identity",
                       aes(y=n_surveys, 
                           alpha=ifelse(n_surveys <= 4, 0.6, 0.5),
                           fill=ifelse(n_surveys <= 4, "blue","red"),
                           tooltip=paste0(cntry_name, '\n', n_surveys, ' surveys'))) +
  labs(
    title="Some very interesting countries are only in some surveys",
    subtitle="E.g. Italy, Greece, Turkey in four or less surveys ",
    caption="ESS surveys 2002-2016") + 
  xlab("") + ylab("# surveys participated") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0),
        legend.position = "None",
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        plot.caption = element_text(color="gray"),
        plot.subtitle = element_text(hjust = 0),
        axis.line = element_blank(), 
        axis.title.x = element_text(size=10, color="gray", hjust = 0),
        axis.text.x = element_text(hjust = 1, vjust=0.5, size=8)) + 
  coord_flip()

MyPrintInteractive(p, plot_interactive)
# creating a caption for table so that it can be referenced in the text later on
# ref https://datascienceplus.com/r-markdown-how-to-number-and-reference-tables/

tab.1_cap <- table_nums(name = "tab_1", 
             caption = "Summary of countries participating the survey, including count of responses each year")

Tab. 1: Summary of countries participating the survey, including count of responses each year

knitr::kable(ds_participations,
             digits = 0,
             col.names = gsub("_", " ", colnames(ds_participations)))
cntry name n ave first year last year n surveys 2002 2004 2006 2008 2010 2012 2014 2016
Austria 2145 2002 2016 5 2257 2256 2405 NA NA NA 1795 2010
Belgium 1793 2002 2016 8 1899 1778 1798 1760 1704 1869 1769 1766
Bulgaria 2081 2006 2012 4 NA NA 1400 2230 2434 2260 NA NA
Croatia 1566 2008 2010 2 NA NA NA 1484 1649 NA NA NA
Cyprus 1102 2006 2012 4 NA NA 995 1215 1083 1116 NA NA
Czechia 2174 2002 2016 7 1360 3026 NA 2018 2386 2009 2148 2269
Denmark 1548 2002 2014 7 1506 1487 1505 1610 1576 1650 1502 NA
Estonia 1916 2004 2016 7 NA 1989 1517 1661 1793 2380 2051 2019
Finland 2025 2002 2016 8 2000 2022 1896 2195 1878 2197 2087 1925
France 1881 2002 2016 8 1503 1806 1986 2073 1728 1968 1917 2070
Germany 2918 2002 2016 8 2919 2870 2916 2751 3031 2958 3045 2852
Greece 2440 2002 2010 4 2566 2406 NA 2072 2715 NA NA NA
Hungary 1642 2002 2016 8 1685 1498 1518 1544 1561 2014 1698 1614
Iceland 737 2004 2016 3 NA 579 NA NA NA 752 NA 880
Ireland 2281 2002 2016 8 2046 2286 1800 1764 2576 2628 2390 2757
Israel 2485 2002 2016 6 2499 NA NA 2490 2294 2508 2562 2557
Italy 1580 2002 2016 4 1207 1529 NA NA NA 960 NA 2626
Lithuania 2040 2010 2016 4 NA NA NA NA 1677 2109 2250 2122
Luxembourg 1594 2002 2004 2 1552 1635 NA NA NA NA NA NA
Netherlands 1898 2002 2016 8 2364 1881 1889 1778 1829 1845 1919 1681
Norway 1656 2002 2016 8 2036 1760 1750 1549 1548 1624 1436 1545
Poland 1766 2002 2016 8 2110 1716 1721 1619 1751 1898 1615 1694
Portugal 1874 2002 2016 8 1511 2052 2222 2367 2150 2151 1265 1270
Russia 2492 2006 2016 5 NA NA 2437 2512 2595 2484 NA 2430
Slovakia 1758 2004 2012 5 NA 1512 1766 1810 1856 1847 NA NA
Slovenia 1364 2002 2016 8 1519 1442 1476 1286 1403 1257 1224 1307
Spain 1938 2002 2016 8 1729 1663 1876 2576 1885 1889 1925 1958
Sweden 1799 2002 2016 8 1999 1948 1927 1830 1497 1847 1791 1551
Switzerland 1732 2002 2016 8 2040 2141 1804 1819 1506 1493 1532 1525
Turkey 2136 2004 2008 2 NA 1856 NA 2416 NA NA NA NA
Ukraine 1997 2004 2012 5 NA 2031 2002 1845 1931 2178 NA NA
United Kingdom 2203 2002 2016 8 2052 1897 2394 2352 2422 2286 2264 1959

Weights

Data has to be weighted for analysis as pointed out in ESS material

Note In general, you must weight tables before quoting percentages from them. To apply weights, select the Weight icon and choose one or more weight variables to use. The Design weights (DWEIGHT) adjust for different selection probabilities, while the Post-stratification weights (PSPWGHT) adjust for sampling error and non-response bias as well as different selection probabilities. Either DWEIGHT or PSPWGHT must always be used. In addition, the Population size weights (PWEIGHT) should be applied if you are looking at aggregates or averages for two or more countries combined. Please see the guide Weighting European Social Survey Data for fuller details about how to use the weights.

We will use pspwght post stratification weighting which attempts to take into account varying response rates in different population groups.

The weights do vary quite a bit, from almost zero all the way up to close to seven, meaning that a response can essentially vote on behalf of almost seven persons and another practically none, as illustrated in the following graph.

p <- ggplot(data = dataset, aes(pspwght)) +
  geom_histogram(bins = 200, fill = 'gray') +
  geom_vline(
    xintercept = quantile(dataset$pspwght),
    linetype = "dashed",
    color = "black") +
  annotate(
    "text",
    label = names(quantile(dataset$pspwght)),
    x = quantile(dataset$pspwght),
    y = 19000,           # fixed by experimentation
    size = 2) +
  labs(
    title = "Individual answer typically weighted by one - but not always",
    subtitle = "Distribution of 'pspwght' across all surveys, highlighting quantiles",
    caption = "ESS surveys 2002-2016") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0),
        panel.border = element_blank(), 
        plot.subtitle = element_text(hjust = 0)) +
  xlab("") + ylab("")
print(p)


Subjective Happiness

First indicator we are interested in is subjective happiness, happy and will consider other indicators later on

Questions

Pre-processing

Checking data

summary(dataset$happy)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.000   8.000   7.754   9.000  99.000

Some pre-processing needed since valid range is 0-10.

In ESS data - as described in their codebook - missing responses, refusal to answer, etc are encoded using numbers outside of the normal range of an indicator. And these vary depending on indicator.

So, for happy we will filter out responses with higher than 10 score since they are not relevant for this study

  • 77 - Refusal
  • 88 - Don’t know
  • 99 - No answer

We replace the higher than 10 values for happy with NA

dataset$happy[dataset$happy > 10] <- NA

# creating new dataset for by country and by survey average of happy 
# by excluding missing data and calculating weighted means
ds_happy_ave <- dataset %>% 
  filter(!is.na(happy)) %>%
  group_by(ess_year, cntry_name) %>% 
  summarise(var_ave = Hmisc::wtd.mean(happy, pspwght),
            n=sum(pspwght))

Making a diagnostic plot to verify that the weighted scores land into the expected range of happy.

# get skewness of the distribution 
happy_skew <- moments::skewness(ds_happy_ave$var_ave)

ggplot(data=ds_happy_ave, aes(var_ave)) +
  geom_histogram(bins = 15, fill='gray') +
  geom_vline(xintercept = quantile(ds_happy_ave$var_ave), linetype="dashed", color="gray") +
  annotate("text", label=names(quantile(ds_happy_ave$var_ave)), 
           x=quantile(ds_happy_ave$var_ave), y=28, size=3) +
  labs(
    title="Subjective happiness is highly skewed",
    subtitle="Weighted by survey/country scores",
    caption="ESS surveys 2002-2016") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.title = element_text(size=6, color="gray"),
    plot.title = element_text(hjust = 0),
    plot.subtitle = element_text(hjust = 0)) +
  xlab("") + ylab("")

summary(ds_happy_ave$var_ave)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.405   6.820   7.403   7.277   7.828   8.455

And that indeed is the case, producing a clearly left skewed distribution (-0.648).

Some countries have been in all surveys and some in only two. So there are no big conclusions to be drawn at this stage.

How is happiness distributed in Europe

Next we investigate the average scores by country on geographic map to see if there are any patterns.

ds_happy_avgsum <- MakeAverageSummary(ds = ds_happy_ave, value_term = "var_ave")

p <- CreateEuroMap(dsin = ds_happy_avgsum %>%
                     filter(cntry_name != "Mean"), 
                   ind_name = "var_ave",
                   txt_title = "North and west appear subjectively happier\nthan east and south east",
                   txt_subtitle = "Average score of 'happy', scale 0-10",
                   txt_caption = "ESS surveys 2002-2016")
MyPrintInteractive(p, plot_interactive)

Europe seems to be divided into two areas and a border zone running diagonally from South West to North East. With Portugal and Israel exceptions to that pattern.

This divide might be a result of different cultural traditions and norms refleced in how people score personal happiness. Therefore, comparing absolute scores between the countries might actually be misleading.

An alternative would be to compare each country by itself, i.e. has the personal assesment of happiness improved or not compared to past in that same country.

Changes of happiness by country and survey

Next we will show average of happy by country for each survey as a heatmap ordered by decreasing average happiness (calculated over all surveys a country has participated in).

p_happy <- MakeIndicatorHeatSet(
   ds = ds_happy_avgsum, 
   value_term = "var_ave", 
   txt_name = "Lower subjective happiness in former \nSoviet block and south east",
   txt_subtitle = "Average score of 'happy', scale 0-10",
   txt_caption = "ESS surveys 2002-2016")

MyPrintInteractive(p_happy, plot_interactive) 

Note that mean across columns and rows is a simple average over the scores shown on the table.

Since each ESS survey often has had different number of countries the averages over columns cannot be used for any meaningful trend evaluation.

Especially as the lowest means seem to correspond to surveys where Turkey, Ukraine and Bulgaria were participating. And these countries are clearly in the overall lower end of subjective happiness spectrum in ESS.

Variations

Heatmap and line graph above indicate some variation of scores between surveys. That is more visible in the plot below where we include all countries

p <- ggplot(data = ds_happy_avgsum %>% filter(ess_year=="Mean", cntry_name != "Mean"),
            aes(x=reorder(cntry_name, var_ave))) +
  geom_point_interactive(data=ds_happy_ave, 
                         aes(y=var_ave, color=as.factor(ess_year),
                             tooltip = paste0(cntry_name, " ", ess_year, 
                                              "\nScore: ", round(var_ave,2)))) +
  geom_point_interactive(aes(y=var_ave,
                             tooltip = paste0(cntry_name, 
                                              "\nMean: ", round(var_ave,2))), 
                         color="black", pch=5) +
  labs(title="Variations get smaller with higher happiness score",
       subtitle="All countries in order of decreasing mean (black diamond)",
       caption = "ESS surveys 2002-2016") +
  scale_color_viridis(discrete=TRUE) +
  theme_bw() +
  theme(
    panel.border = element_blank(),
      axis.title=element_blank(),
    plot.title = element_text(hjust = 0),
    plot.subtitle = element_text(hjust = 0)) +
  labs(color="ESS Year") +
  coord_flip()

MyPrintInteractive(p, plot_interactive) 

For some countries there is quite wide variation in the scores between surveys. But in general it looks like those on the high end of happiness have had less variation than those on the lower end.

ds_happy_sd <- ds_happy_ave %>% 
  group_by(cntry_name) %>% 
  dplyr::summarize(sd = sd(var_ave), n=n()) %>% 
  mutate(cntry_name2 = paste0(cntry_name, " (",n,")")) %>%
  arrange(desc(sd)) %>%
  filter(n >= 4)

p <- ggplot(data=ds_happy_sd, aes(x=reorder(cntry_name2, sd), y=sd, fill=n)) +
  geom_bar_interactive(aes(tooltip=paste0(cntry_name, "\nSurveys: ", n, "\nSD: ", round(sd,2))),
                       stat = "identity") + 
  coord_flip() +
  labs(title="Top three in variation: Italy, Greece and Poland",
       subtitle="Countries in four or more surveys included",
       caption="ESS surveys 2002-2016") +
  scale_color_viridis(discrete=TRUE) +
  ylab("Standard deviation of 'happy") + xlab("") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    plot.title = element_text(hjust = 0),
    plot.subtitle = element_text(hjust = 0)) 

MyPrintInteractive(p, plot_interactive) 

Countries in three or less surveys excluded.

Significance and meaning of changes

There is a visual suggestion that subjective happiness may have improved for several countries over the past couple of surveys.

So, we will check the significance of the change from 2002 to 2016 as well as from
2014 to 2016 using weighted t-test for each country separately.

Note the significance levels coded in symbols

symbol level
*** 0.001
** 0.01
* 0.05
+ 0.1
1

And we will check if the change was meaningful - having a meaningful effect - using Cohen’s d metric.

happy_res <- GetComparisonBoxPlot(indval_name =  "happy", 
                                  ds_avg = ds_happy_avgsum %>% select(-n), 
                                  ds = dataset %>% filter(!is.na(happy)),
                                  cntry_list = cntry_in_100pct,
                                  txt_name = "Subjective happiness", 
                                  txt_subtitle= "", 
                                  txt_caption="ESS 2002-2016",
                                  skip_plot = FALSE)

tab.2_cap <- table_nums(name="tab_2",
                        caption = "Changes of subjective happiness 2002 vs 2016 and 2014 vs 2016 with p-values for significance of the change, Cohen's d values for effect size")

Tab. 2: Changes of subjective happiness 2002 vs 2016 and 2014 vs 2016 with p-values for significance of the change, Cohen’s d values for effect size

knitr::kable(happy_res$table, digits = 3,
             col.names = gsub("_", " ", colnames(happy_res$table)))
cntry name 2002 2014 2016 d0216 eff cil 0216 eff ciu 0216 eff m 0216 p 0216 sig 0216 d1416 eff cil 1416 eff ciu 1416 eff m 1416 p 1416 sig 1416
Ireland 7.863 7.331 7.550 -0.312 0.071 0.075 negligible 0.000 *** 0.219 0.071 0.075 negligible 0.000 ***
Sweden 7.874 7.896 7.850 -0.024 0.079 0.083 negligible 0.644 -0.047 0.079 0.083 negligible 0.366
France 7.410 7.351 7.396 -0.014 0.010 0.014 negligible 0.850 0.045 0.010 0.014 negligible 0.432
Belgium 7.695 7.743 7.727 0.033 -0.014 -0.010 negligible 0.552 -0.016 -0.014 -0.010 negligible 0.834
United Kingdom 7.602 7.583 7.649 0.047 0.365 0.369 small 0.388 0.065 0.365 0.369 small 0.236
Netherlands 7.847 7.865 7.937 0.090 -0.021 -0.017 negligible 0.026 * 0.072 -0.021 -0.017 negligible 0.080 +
Finland 8.031 8.038 8.122 0.091 -0.043 -0.039 negligible 0.058 + 0.084 -0.043 -0.039 negligible 0.076 +
Switzerland 8.043 8.088 8.168 0.125 -0.039 -0.035 negligible 0.016 * 0.080 -0.039 -0.035 negligible 0.136
Norway 7.902 7.957 8.087 0.184 -0.055 -0.051 negligible 0.000 *** 0.129 -0.055 -0.051 negligible 0.018 *
Spain 7.457 7.437 7.747 0.290 -0.079 -0.075 negligible 0.000 *** 0.310 -0.079 -0.075 negligible 0.000 ***
Portugal 6.953 6.973 7.437 0.484 -0.075 -0.070 negligible 0.000 *** 0.465 -0.075 -0.070 negligible 0.000 ***
Germany 7.189 7.586 7.757 0.568 -0.077 -0.074 negligible 0.000 *** 0.171 -0.077 -0.074 negligible 0.000 ***
Slovenia 6.906 7.116 7.477 0.571 -0.230 -0.225 small 0.000 *** 0.361 -0.230 -0.225 small 0.000 ***
Hungary 6.325 6.384 6.901 0.577 -0.067 -0.062 negligible 0.000 *** 0.518 -0.067 -0.062 negligible 0.000 ***
Poland 6.422 7.270 7.475 1.053 -0.436 -0.432 small 0.000 *** 0.204 -0.436 -0.432 small 0.002 **
MyPrintInteractive(happy_res$plot, plot_interactive) 

The box-plots are in line with t-test results shown in table Tab. 2. But only a couple of the changes was meaningful according to COhen’s d measure, and even the the effect was only small.

But perhaps that is acceptable in a survey like this. The situation would be quite different if we were looking for an impact coming from a controlled experiment.

Next we check a few individual cases to get better understanding of data.

cowplot::plot_grid(
  PlotIndicatorHistogram(dataset, "Finland", 2002, "happy"),
  PlotIndicatorHistogram(dataset, "Finland", 2012, "happy"), 
  PlotIndicatorHistogram(dataset, "Finland", 2016, "happy"), 
  PlotIndicatorHistogram(dataset, "Switzerland", 2016, "happy"), 
  ncol = 2)

cowplot::plot_grid(
  PlotIndicatorHistogram(dataset, "Poland", 2012, "happy"),
  PlotIndicatorHistogram(dataset, "France", 2002, "happy"),
  PlotIndicatorHistogram(dataset, "France", 2012, "happy"), 
  PlotIndicatorHistogram(dataset, "France", 2014, "happy"), 
  ncol = 2)

Skew can be strong, as can be seen in the histograms above..

However, the sample counts used in t-tests are quite large, as shown in Tab. 1, so the results should be fairly accurate.

Wrapup and what next

So, there are changes that are statistically significant even if their effect is small or even neglible.

We have a few hundred additional indicators in ESS dataset. Do we have anything there which has association or correlation with subjective happiness?


Happiness and associations

Subjective happiness varies and has trends - sort of at least.

The next question is: are there significant associations with subjective happiness and othedr ESS variabkes

We pick a few additional indicators which - quite subjectively - seem like they could be associated with feeling of happiness. In a way or another.

Additional indicators mainly for context taken from

These might be used in a more formal as part of analysis way later.

Scoping to the countries which have been in 50% or more of the surveys (four or more out of the eight surveys).

And just as a reminder, it is about associations and correlations only. Not possible to go into honest discussion about causality based on the data available.

ds_subset_50 <- dataset %>% 
  filter(cntry_name %in% cntry_in_50pct)

ESS indicators

Granted, whole ESS dataset contains many more and the selection made here and now has no hard scientific basis. However, they seem intuitively such which might be related to subjective happiness.

The indicators chosen

# get indicators and their descriptions from a file which is easier to maintain than
# writing the same into R-code

metadata_columns <- c("cntry_name","ess_year", "idno", "pspwght")
indicator_table <- readxl::read_excel("codemap.xlsx")
ess_indicators <- union(metadata_columns, 
                        as.character((indicator_table %>% filter(source=="ESS"))$indicator))
knitr::kable(indicator_table, caption = "ESS indicators along with their ranges")
ESS indicators along with their ranges
indicator name min max source lblmin lblmax
happy Taking all things together, how happy would you say you are 0 10 ESS Extremely unhappy Extremely happy
ppltrst Most people can be trusted or you can’t be too careful 0 10 ESS You can’t be too careful Most people can be trusted
pplfair Most people try to take advantage of you, or try to be fair 0 10 ESS Most people try to take advantage of me Most people try to be fair
pplhlp Most of the time people helpful or mostly looking out for themselves 0 10 ESS People mostly look out for themselves People mostly try to be helpful
polintr How interested in politics 1 4 ESS Very interested Not at all interested
trstprl Trust in country’s parliament 0 10 ESS No trust at all Complete trust
trstlgl Trust in the legal system 0 10 ESS No trust at all Complete trust
trstplc Trust in the police 0 10 ESS No trust at all Complete trust
trstplt Trust in politicians 0 10 ESS No trust at all Complete trust
trstprt Trust in political parties 0 10 ESS No trust at all Complete trust
trstep Trust in the European Parliament 0 10 ESS No trust at all Complete trust
trstun Trust in the United Nations 0 10 ESS No trust at all Complete trust
stflife How satisfied with life as a whole 0 10 ESS Extremely dissatisfied Extremely satisfied
stfeco How satisfied with present state of economy in country 0 10 ESS Extremely dissatisfied Extremely satisfied
stfgov How satisfied with the national government 0 10 ESS Extremely dissatisfied Extremely satisfied
stfdem How satisfied with the way democracy works in country 0 10 ESS Extremely dissatisfied Extremely satisfied
stfedu State of education in country nowadays 0 10 ESS Extremely dissatisfied Extremely satisfied
stfhlth State of health services in country nowadays 0 10 ESS Extremely dissatisfied Extremely satisfied
eisced Highest level of education, ES ISCED 1 7 ESS ES-ISCED I , less than lower secondary ES-ISCED V2, higher tertiary education, >= MA level
health Subjective general health 1 5 ESS Very good Very bad
euftf European Union: European unification go further or gone too far 0 10 ESS Unification already gone too far Unification go further
impcntr Allow many/few immigrants from poorer countries outside Europe 1 4 ESS Allow many to come and live here Allow none
imbgeco Immigration bad or good for country’s economy 0 10 ESS Bad for the economy Good for the economy
imueclt Country’s cultural life undermined or enriched by immigrants 0 10 ESS Cultural life undermined Cultural life enriched
imwbcnt Immigrants make country worse or better place to live 0 10 ESS Worse place to live Better place to live

The following might be candidates at later stage, but we will start with those above

  • brncntr Born in country All rounds: Were you born in [country]?
  • gndr - Gender All rounds: CODE SEX, respondent
  • yrbrn - Year of birth All rounds: And in what year were you born?
  • agea - Age of respondent, calculated All rounds: Age of respondent, calculated
ds_subset_50 <- ds_subset_50 %>%
  select(which(colnames(dataset) %in% ess_indicators))
print(dim(ds_subset_50))
## [1] 363455     29

Pre-processing

As with happy earlier, there is the need to make some adjustments to get a better usable set of data for analysis. Using min/max rules from the definition file.

# marking excluded data as NA

print("Before")
## [1] "Before"
summary(ds_subset_50[, indicator_table$indicator])
##      happy           ppltrst          pplfair           pplhlp      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 6.000   1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.: 3.000  
##  Median : 8.000   Median : 5.000   Median : 6.000   Median : 5.000  
##  Mean   : 7.198   Mean   : 5.329   Mean   : 6.395   Mean   : 5.347  
##  3rd Qu.: 9.000   3rd Qu.: 7.000   3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :10.000   Max.   :99.000   Max.   :99.000   Max.   :99.000  
##  NA's   :2437                                                       
##     polintr         trstprl         trstlgl          trstplc      
##  Min.   :1.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.: 3.00   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median :3.000   Median : 5.00   Median : 5.000   Median : 6.000  
##  Mean   :2.637   Mean   : 6.68   Mean   : 7.343   Mean   : 6.977  
##  3rd Qu.:3.000   3rd Qu.: 6.00   3rd Qu.: 7.000   3rd Qu.: 8.000  
##  Max.   :9.000   Max.   :99.00   Max.   :99.000   Max.   :99.000  
##                                                                   
##     trstplt          trstprt          trstep          trstun     
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 2.000   1st Qu.: 2.00   1st Qu.: 3.00   1st Qu.: 4.00  
##  Median : 4.000   Median : 4.00   Median : 5.00   Median : 6.00  
##  Mean   : 5.274   Mean   : 5.73   Mean   :14.39   Mean   :14.12  
##  3rd Qu.: 5.000   3rd Qu.: 5.00   3rd Qu.: 7.00   3rd Qu.: 8.00  
##  Max.   :99.000   Max.   :99.00   Max.   :99.00   Max.   :99.00  
##                   NA's   :40807                                  
##     stflife           stfeco           stfgov           stfdem      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 5.000   1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.: 4.000  
##  Median : 7.000   Median : 5.000   Median : 4.000   Median : 5.000  
##  Mean   : 7.278   Mean   : 6.434   Mean   : 6.958   Mean   : 8.676  
##  3rd Qu.: 8.000   3rd Qu.: 7.000   3rd Qu.: 6.000   3rd Qu.: 7.000  
##  Max.   :99.000   Max.   :99.000   Max.   :99.000   Max.   :99.000  
##                                    NA's   :2046                     
##      stfedu        stfhlth          eisced           health     
##  Min.   : 0.0   Min.   : 0.00   Min.   : 0.000   Min.   :1.000  
##  1st Qu.: 4.0   1st Qu.: 3.00   1st Qu.: 1.000   1st Qu.:2.000  
##  Median : 6.0   Median : 5.00   Median : 3.000   Median :2.000  
##  Mean   : 9.8   Mean   : 6.37   Mean   : 3.391   Mean   :2.245  
##  3rd Qu.: 8.0   3rd Qu.: 7.00   3rd Qu.: 5.000   3rd Qu.:3.000  
##  Max.   :99.0   Max.   :99.00   Max.   :99.000   Max.   :9.000  
##                                                                 
##      euftf          impcntr         imbgeco          imueclt      
##  Min.   : 0.00   Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.00   1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median : 5.00   Median :3.000   Median : 5.000   Median : 6.000  
##  Mean   :13.61   Mean   :2.814   Mean   : 8.987   Mean   : 9.386  
##  3rd Qu.: 8.00   3rd Qu.:3.000   3rd Qu.: 7.000   3rd Qu.: 8.000  
##  Max.   :99.00   Max.   :9.000   Max.   :99.000   Max.   :99.000  
##  NA's   :91616                                                    
##     imwbcnt      
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 9.054  
##  3rd Qu.: 7.000  
##  Max.   :99.000  
## 
for (i in 1:nrow(indicator_table)){
  an_ind <- indicator_table[i,]
  #print(an_ind$indicator)
  # print(head(ds_subset_50[,indicator_table$indicator[i]]))
  ds_subset_50[,an_ind$indicator] <- ifelse(
    ds_subset_50[,an_ind$indicator] > an_ind$max
    | ds_subset_50[,an_ind$indicator] < an_ind$min, 
    NA, 
    ds_subset_50[,an_ind$indicator])
}

print("After")
## [1] "After"
summary(ds_subset_50[, indicator_table$indicator])
##      happy           ppltrst          pplfair           pplhlp      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 6.000   1st Qu.: 3.000   1st Qu.: 4.000   1st Qu.: 3.000  
##  Median : 8.000   Median : 5.000   Median : 6.000   Median : 5.000  
##  Mean   : 7.198   Mean   : 5.012   Mean   : 5.587   Mean   : 4.875  
##  3rd Qu.: 9.000   3rd Qu.: 7.000   3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  NA's   :2437     NA's   :1378     NA's   :3548     NA's   :2038    
##     polintr         trstprl          trstlgl          trstplc      
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.: 2.000   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median :3.000   Median : 5.000   Median : 5.000   Median : 6.000  
##  Mean   :2.619   Mean   : 4.349   Mean   : 5.064   Mean   : 5.884  
##  3rd Qu.:3.000   3rd Qu.: 6.000   3rd Qu.: 7.000   3rd Qu.: 8.000  
##  Max.   :4.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##  NA's   :1173    NA's   :10146    NA's   :9987     NA's   :4831    
##     trstplt          trstprt          trstep          trstun     
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 2.000   1st Qu.: 2.00   1st Qu.: 3.00   1st Qu.: 3.00  
##  Median : 3.000   Median : 3.00   Median : 5.00   Median : 5.00  
##  Mean   : 3.495   Mean   : 3.45   Mean   : 4.41   Mean   : 5.12  
##  3rd Qu.: 5.000   3rd Qu.: 5.00   3rd Qu.: 6.00   3rd Qu.: 7.00  
##  Max.   :10.000   Max.   :10.00   Max.   :10.00   Max.   :10.00  
##  NA's   :7641     NA's   :49501   NA's   :43400   NA's   :39475  
##     stflife           stfeco           stfgov           stfdem     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 5.000   1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.: 3.00  
##  Median : 7.000   Median : 5.000   Median : 4.000   Median : 5.00  
##  Mean   : 6.851   Mean   : 4.432   Mean   : 4.171   Mean   : 5.17  
##  3rd Qu.: 8.000   3rd Qu.: 6.000   3rd Qu.: 6.000   3rd Qu.: 7.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##  NA's   :1924     NA's   :8710     NA's   :14107    NA's   :15411  
##      stfedu          stfhlth           eisced          health     
##  Min.   : 0.000   Min.   : 0.000   Min.   :1.0     Min.   :1.000  
##  1st Qu.: 4.000   1st Qu.: 3.000   1st Qu.:2.0     1st Qu.:2.000  
##  Median : 6.000   Median : 5.000   Median :4.0     Median :2.000  
##  Mean   : 5.531   Mean   : 5.243   Mean   :3.8     Mean   :2.236  
##  3rd Qu.: 7.000   3rd Qu.: 7.000   3rd Qu.:5.0     3rd Qu.:3.000  
##  Max.   :10.000   Max.   :10.000   Max.   :7.0     Max.   :5.000  
##  NA's   :18733    NA's   :4856     NA's   :72220   NA's   :527    
##      euftf           impcntr         imbgeco          imueclt      
##  Min.   : 0.00    Min.   :1.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 3.00    1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median : 5.00    Median :3.000   Median : 5.000   Median : 5.000  
##  Mean   : 5.12    Mean   :2.592   Mean   : 4.844   Mean   : 5.458  
##  3rd Qu.: 7.00    3rd Qu.:3.000   3rd Qu.: 7.000   3rd Qu.: 7.000  
##  Max.   :10.00    Max.   :4.000   Max.   :10.000   Max.   :10.000  
##  NA's   :119493   NA's   :15057   NA's   :18111    NA's   :17292   
##     imwbcnt      
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 4.812  
##  3rd Qu.: 6.000  
##  Max.   :10.000  
##  NA's   :18536

Shape of distributions

Showing the shape of all indicators, excluding metadata columns to help understand the data better.

GetMinMaxLabel <- function(ind_name, indicator_table){
  this_ind <- indicator_table %>% filter(indicator==ind_name)
  if (!is.na(this_ind$lblmin)){
    minmax_label <- paste(this_ind$lblmin, this_ind$min, 
                         '<-->', 
                         this_ind$lblmax, this_ind$max)
  }else {
    minmax_label <- ""
  }
  #print(minmax_label)
  return(minmax_label)
}

ind_plots <- lapply(
  setdiff(colnames(ds_subset_50), metadata_columns), 
  function(x, ds){
    ds$tmp_filter <- ds[,x]
    PlotIndicatorHistogram(ds %>% 
                             filter(!is.na(tmp_filter)), 
                           "", 
                           NA, 
                           x, 
                           n_bins = as.integer(max(ds$tmp_filter, na.rm = TRUE)))
  }, 
  ds=ds_subset_50)

# no need to make these plots interactive - print as they are
for (i in 1:length(ind_plots))
  print(ind_plots[[i]])

Mostly the distributions at this level do not seem significantly skewed.

Concern: This could be an effect from averaging over all data and could hide potential time-dependent or country-dependent skew impacting later analysis. However, at this stage we will assume that there are no issues that would require making elaborate transformations like what we experimented with in the first analysis of happy earlier. Perhaps later.

Weights

Next we apply post stratification weights for the selected indicators to get meaningful averages.

ds_subset_50_ave <- ds_subset_50 %>%
  gather(indicator, value, indicator_table$indicator) %>%
  filter(!is.na(value)) %>%
  group_by(cntry_name, ess_year, indicator) %>% 
  summarise(ave = Hmisc::wtd.mean(value, pspwght)) %>%
  spread(indicator, ave) %>%
  ungroup()

Missing values

Checking the indicators.

summary(ds_subset_50_ave %>% select(-cntry_name, -ess_year))
##      eisced          euftf           happy           health     
##  Min.   :2.272   Min.   :3.818   Min.   :5.405   Min.   :1.685  
##  1st Qu.:3.449   1st Qu.:4.716   1st Qu.:6.825   1st Qu.:1.972  
##  Median :3.645   Median :5.129   Median :7.396   Median :2.186  
##  Mean   :3.681   Mean   :5.190   Mean   :7.272   Mean   :2.193  
##  3rd Qu.:3.926   3rd Qu.:5.611   3rd Qu.:7.819   3rd Qu.:2.370  
##  Max.   :5.034   Max.   :7.332   Max.   :8.366   Max.   :2.915  
##  NA's   :36      NA's   :47                                     
##     imbgeco         impcntr         imueclt         imwbcnt     
##  Min.   :3.017   Min.   :1.763   Min.   :3.150   Min.   :2.814  
##  1st Qu.:4.443   1st Qu.:2.374   1st Qu.:5.023   1st Qu.:4.395  
##  Median :4.893   Median :2.550   Median :5.571   Median :4.788  
##  Mean   :4.842   Mean   :2.576   Mean   :5.462   Mean   :4.836  
##  3rd Qu.:5.328   3rd Qu.:2.759   3rd Qu.:6.007   3rd Qu.:5.395  
##  Max.   :6.170   Max.   :3.558   Max.   :7.322   Max.   :6.617  
##                                                                 
##     polintr         pplfair          pplhlp         ppltrst     
##  Min.   :2.158   Min.   :3.666   Min.   :2.998   Min.   :3.329  
##  1st Qu.:2.451   1st Qu.:4.941   1st Qu.:4.228   1st Qu.:4.289  
##  Median :2.673   Median :5.563   Median :4.760   Median :4.968  
##  Mean   :2.644   Mean   :5.582   Mean   :4.859   Mean   :5.015  
##  3rd Qu.:2.790   3rd Qu.:6.157   3rd Qu.:5.660   3rd Qu.:5.666  
##  Max.   :3.148   Max.   :7.285   Max.   :6.984   Max.   :6.916  
##                                                                 
##      stfdem          stfeco          stfedu          stfgov     
##  Min.   :2.434   Min.   :1.368   Min.   :3.126   Min.   :1.553  
##  1st Qu.:4.449   1st Qu.:3.476   1st Qu.:4.917   1st Qu.:3.525  
##  Median :5.075   Median :4.452   Median :5.520   Median :4.251  
##  Mean   :5.196   Mean   :4.484   Mean   :5.581   Mean   :4.212  
##  3rd Qu.:6.059   3rd Qu.:5.367   3rd Qu.:6.267   3rd Qu.:4.945  
##  Max.   :7.423   Max.   :7.937   Max.   :8.029   Max.   :6.580  
##                                                  NA's   :1      
##     stfhlth         stflife          trstep         trstlgl     
##  Min.   :2.482   Min.   :4.223   Min.   :2.587   Min.   :1.898  
##  1st Qu.:4.157   1st Qu.:6.386   1st Qu.:4.039   1st Qu.:4.189  
##  Median :5.397   Median :7.042   Median :4.579   Median :5.076  
##  Mean   :5.286   Mean   :6.916   Mean   :4.477   Mean   :5.062  
##  3rd Qu.:6.302   3rd Qu.:7.664   3rd Qu.:4.904   3rd Qu.:5.992  
##  Max.   :7.667   Max.   :8.552   Max.   :5.963   Max.   :7.595  
##                                                                 
##     trstplc         trstplt         trstprl         trstprt     
##  Min.   :2.035   Min.   :1.390   Min.   :1.671   Min.   :1.404  
##  1st Qu.:5.104   1st Qu.:2.770   1st Qu.:3.594   1st Qu.:2.712  
##  Median :6.062   Median :3.471   Median :4.367   Median :3.383  
##  Mean   :5.885   Mean   :3.517   Mean   :4.381   Mean   :3.490  
##  3rd Qu.:6.647   3rd Qu.:4.288   3rd Qu.:5.272   3rd Qu.:4.299  
##  Max.   :8.199   Max.   :5.532   Max.   :6.690   Max.   :5.597  
##                                                  NA's   :21     
##      trstun     
##  Min.   :2.596  
##  1st Qu.:4.711  
##  Median :5.100  
##  Mean   :5.159  
##  3rd Qu.:5.549  
##  Max.   :6.859  
## 
NA_columns <- ds_subset_50_ave %>% 
  select_if(function(x) any(is.na(x))) %>% 
  summarise_all(funs(sum(is.na(.)))) 
knitr::kable(NA_columns, caption = "Columns with NA entries")
Columns with NA entries
eisced euftf stfgov trstprt
36 47 1 21
txt_NA <- paste(colnames(NA_columns), collapse=", ")

Indicators eisced, euftf, stfgov, trstprt have some missing values. We will create new indicators where we fill in missing values by interpolation and/or use the first / last real value for missing beginning / end values.

Naturally, these are not the real thing anymore but should be close enough for this analysis.

# special case for Greece which has only one entry for eisced (education level)

Greece_eisced2 <- ds_subset_50_ave %>% 
  filter(cntry_name == "Greece", !is.na(eisced)) %>% 
  select(eisced)
if (nrow(Greece_eisced2) != 1) stop("Greece eisced assumption not valid")

# using zoo package to impute missing values

ds_subset_50_ave <- ds_subset_50_ave %>% 
  group_by(cntry_name) %>% 
  mutate(
    eisced2 = ifelse(cntry_name == "Greece",
                     as.numeric(Greece_eisced2$eisced[1]),
                     zoo::na.fill(eisced, "extend")),
    trstprt2 = zoo::na.fill(trstprt, "extend"),
    stfgov2 = zoo::na.fill(stfgov, "extend"),
    euftf2 = zoo::na.fill(euftf, "extend")) %>% 
  ungroup()

# Diagnostic to show what values were before/after imputation
# but as the table is fairly long, commented out now
# 
# cntry_imputed <- (ds_subset_50_ave %>% 
#                     filter(!complete.cases(.)) %>% 
#                     select(cntry_name) %>% 
#                     filter(!duplicated(cntry_name)))$cntry_name
# knitr::kable(ds_subset_50_ave %>% 
#                filter(cntry_name %in% cntry_imputed) %>%
#                select(cntry_name, ess_year, eisced, eisced2, trstprt, trstprt2, 
#                       stfgov, stfgov2, euftf, euftf2),
#              digits = 3,
#              caption = "Table of original and imputed values")

# removing the originals from the dataframe as they are not needed in the 
# subsequent work

ds_subset_50_ave <- ds_subset_50_ave %>%
  select(-eisced, -trstprt, -stfgov, -euftf)

Get rid of collinear variables

Since we intend to come up with a model we must remove indicators that are essentially the same, i.e. indicators which have too high correlation and would really not add to the model and would be in conflict with the requirements of linear modeling we aim to do.

coll_cutoff <- 0.9
print(paste("Cutoff level for correlation", coll_cutoff))
## [1] "Cutoff level for correlation 0.9"
# exclude metadata indicators

df1 = ds_subset_50_ave %>% select(-ess_year, -cntry_name)
df2 = cor(df1)

# findCorrelation searches through a correlation matrix and returns a vector of 
# integers corresponding to columns to remove to reduce pair-wise correlations.

hc = caret::findCorrelation(
  df2, 
  cutoff=coll_cutoff, 
  verbose = TRUE,     # verbose helps to understand better what happens
  names = FALSE)
## Compare row 11  and column  23 with corr  0.905 
##   Means:  0.673 vs 0.535 so flagging column 11 
## Compare row 23  and column  19 with corr  0.987 
##   Means:  0.663 vs 0.523 so flagging column 23 
## Compare row 8  and column  10 with corr  0.933 
##   Means:  0.656 vs 0.51 so flagging column 8 
## Compare row 19  and column  20 with corr  0.951 
##   Means:  0.635 vs 0.497 so flagging column 19 
## Compare row 20  and column  17 with corr  0.914 
##   Means:  0.617 vs 0.484 so flagging column 20 
## Compare row 10  and column  9 with corr  0.909 
##   Means:  0.617 vs 0.47 so flagging column 10 
## Compare row 15  and column  1 with corr  0.967 
##   Means:  0.601 vs 0.453 so flagging column 15 
## All correlations <= 0.9
hc = sort(hc)

# summary
keep_table <- data.frame(
  rownr = 1:dim(df1)[2], 
  indicator = colnames(df1),
  keep = ifelse((1:dim(df1)[2]) %in% hc, 
                "drop", 
                "keep"),
  name = sapply(colnames(df1), function(x) GetIndicator(x, indicator_table)$name)
)

knitr::kable(keep_table, caption = "Features")
Features
rownr indicator keep name
happy 1 happy keep Taking all things together, how happy would you say you are
health 2 health keep Subjective general health
imbgeco 3 imbgeco keep Immigration bad or good for country’s economy
impcntr 4 impcntr keep Allow many/few immigrants from poorer countries outside Europe
imueclt 5 imueclt keep Country’s cultural life undermined or enriched by immigrants
imwbcnt 6 imwbcnt keep Immigrants make country worse or better place to live
polintr 7 polintr keep How interested in politics
pplfair 8 pplfair drop Most people try to take advantage of you, or try to be fair
pplhlp 9 pplhlp keep Most of the time people helpful or mostly looking out for themselves
ppltrst 10 ppltrst drop Most people can be trusted or you can’t be too careful
stfdem 11 stfdem drop How satisfied with the way democracy works in country
stfeco 12 stfeco keep How satisfied with present state of economy in country
stfedu 13 stfedu keep State of education in country nowadays
stfhlth 14 stfhlth keep State of health services in country nowadays
stflife 15 stflife drop How satisfied with life as a whole
trstep 16 trstep keep Trust in the European Parliament
trstlgl 17 trstlgl keep Trust in the legal system
trstplc 18 trstplc keep Trust in the police
trstplt 19 trstplt drop Trust in politicians
trstprl 20 trstprl drop Trust in country’s parliament
trstun 21 trstun keep Trust in the United Nations
eisced2 22 eisced2 keep Highest level of education, ES ISCED
trstprt2 23 trstprt2 drop Trust in political parties
stfgov2 24 stfgov2 keep How satisfied with the national government
euftf2 25 euftf2 keep European Union: European unification go further or gone too far
# the reduced data
reduced_Data = df1[,-c(hc)]

colnames_keep <- union(colnames(reduced_Data),c("ess_year","cntry_name"))
ds_subset_50_ave <- ds_subset_50_ave %>% select(colnames_keep)

Reshape

Turn dataset into narrow format for flexibility

ds_ss_avg  <- ds_subset_50_ave %>% 
  gather(indicator, value, which(colnames(ds_subset_50_ave) %in% colnames(reduced_Data)))
print(dim(ds_ss_avg))
## [1] 3366    4

World Bank

World Bank Development Indices provide a good backdrop and context for ESS.

Selecting a few which appear like they could be relevant in this context, which suggests indicators related to migration, refugees and economical situation.

Again, this is fairly subjective selection and something important may be missing.

Since the countries differ in size - and therefore the indicators as well - we use percentages and ratios when possible.

wb_data <- GetPreProcessWbData(
  read_rds = TRUE, 
  inc_countries = unique(ds_ss_avg$cntry_name), 
  min_year = min(dataset$ess_year), 
  max_year = max(dataset$ess_year))
## Joining, by = "indicator"
# merge 

ds_ss_avg <- ds_ss_avg %>%
  bind_rows(wb_data$data %>% 
              select(ess_year, cntry_name, indicator, value))

# add definitions
indicator_table <- indicator_table %>%
  bind_rows(wb_data$indicators)

knitr::kable(indicator_table %>% filter(source=="WB"), digits = 3)
indicator name min max source lblmin lblmax
BX.TRF.PWKR.DT.GD.ZS Personal remittances, received (% of GDP) 7.000000e-02 8.154000e+00 WB NA NA
GC.TAX.TOTL.GD.ZS Tax revenue (% of GDP) 1.230000e+00 4.834400e+01 WB NA NA
IT.NET.USER.ZS Individuals using the Internet (% of population) 1.874000e+00 9.729800e+01 WB NA NA
NY.GDP.MKTP.KD GDP (constant 2010 US$) 1.592704e+10 3.781699e+12 WB NA NA
NY.GDP.MKTP.KD.ZG GDP growth (annual %) -1.481400e+01 2.555700e+01 WB NA NA
NY.GDP.PCAP.KD GDP per capita (constant 2010 US$) 2.130659e+03 9.161728e+04 WB NA NA
NY.GDP.PCAP.KD.ZG GDP per capita growth (annual %) -1.456000e+01 2.476500e+01 WB NA NA
REFG.PCT Refugees (% of population) - calculated 0.000000e+00 2.319000e+00 WB NA NA
SE.XPD.TOTL.GD.ZS Government expenditure on education, total (% of GDP) 2.326000e+00 8.560000e+00 WB NA NA
SH.XPD.GHED.GD.ZS Domestic general government health expenditure (% of GDP) 0.000000e+00 9.420000e+00 WB NA NA
SL.TLF.CACT.ZS Labor force participation rate, total (% of total population ages 15+) (modeled ILO estimate) 4.785300e+01 6.856500e+01 WB NA NA
SM.POP.REFG Refugee population by country or territory of asylum 5.000000e+00 9.800000e+05 WB NA NA
SM.POP.TOTL.ZS International migrant stock (% of population) 7.950000e-01 2.938700e+01 WB NA NA
SP.POP.TOTL Population, total 9.769660e+05 1.453060e+08 WB NA NA
SP.RUR.TOTL.ZS Rural population (% of total population) 2.103000e+00 5.037300e+01 WB NA NA
SP.URB.TOTL.IN.ZS Urban population (% of total) 4.962700e+01 9.789700e+01 WB NA NA

For clarity

  • International migrant stock above means people who have been born in another country
  • Refugees are different group of people than e.g. asylum seekers, which are not in World Bank dataset at all

For asylum seekers we need to turn to UNHCR

UNHCR

Given that refugees and asylum seekers have been an important factor in the 2000s Europe we need to get more data, which is available at UNHCR at UNHCR Population Statistics - time series

Since World Bank and UNHCR data are not necessarily in full synch the numbers on refugees, which is available from both sources, is expected to be not exactly the same.

Therefore we take refugees from World Bank and other “people of concern” groups from UNHCR

But first, we check what is in that UNHCR people of concern classification and how it has changed over 2000s.

# take data from beginning of ess surveys

ds_unhcr <- GetUnhcrData(
  data_folder = my_data_folder, 
  inc_countries = unique(ds_ss_avg$cntry_name), 
  min_year = min(dataset$ess_year), 
  max_year = max(dataset$ess_year), 
  read_rds = TRUE)

p <- ggplot(data= ds_unhcr, 
            aes(x=ess_year, y=value, fill=pop_type, color=pop_type)) +
  geom_bar_interactive(stat="identity", 
                       aes(tooltip=paste(ess_year, pop_type))) +
  labs(title="Count of UNHCR 'people of concern' in Europe\nincreasing rapidly",
       subtitle="Countries participating in ESS",
       caption="UNHCR") +
  theme_bw() +
  theme(
    panel.border = element_blank()
  ) +
  xlab("") + ylab("")
MyPrintInteractive(p, plot_interactive)

So, a downwarsd trend turned into a sharp rise 2014 as - some might say - a thing hit the fan..

Some insigt - perhaps - from two snapshots, i.e. 2013 and 2016.

p <- ggplot(data= ds_unhcr %>% filter(ess_year==2013), 
       aes(x=reorder(cntry_name, -value), y=value, fill=pop_type)) +
  geom_bar_interactive(stat="identity", 
                       aes(tooltip=paste(cntry_name,'-', pop_type))) +
  labs(title="People of concern in Europe 2013",
       subtitle="Countries participating in ESS",
       caption="UNHCR") +
  xlab("") + ylab("") + 
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
  coord_flip()
MyPrintInteractive(p, plot_interactive)
p <- ggplot(data= ds_unhcr %>% filter(ess_year==2016), 
       aes(x=reorder(cntry_name, -value), y=value, fill=pop_type)) +
  geom_bar_interactive(stat="identity", 
                       aes(tooltip=paste(cntry_name, '-', pop_type))) +
  labs(title="People of concern in Europe 2016",
       subtitle="Countries participating in ESS",
       caption="UNHCR") +
  xlab("") + ylab("") +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
  coord_flip()
MyPrintInteractive(p, plot_interactive)

Huge increase in “internally displaced persons” in Ukraine - yes, there was and still is that crisis regarding Crimea

Trying log10 scaler for y-axiz to get better view.

p <- ggplot(data= ds_unhcr %>% filter(ess_year==2016), 
       aes(x=reorder(cntry_name, -value), y=(value+1), fill=pop_type)) +
  geom_bar_interactive(stat="identity", 
                       aes(tooltip=paste(cntry_name, '-', pop_type))) +
  labs(title="People of concern in Europe 2016",
       subtitle="Countries participating in ESS - log10 scale",
       caption="UNHCR") +
  xlab("") + ylab("") + 
  scale_y_log10() +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
  coord_flip()
MyPrintInteractive(p, plot_interactive)
# top 4 and all others

top_pop_types <- as.character((ds_unhcr %>% 
  group_by(pop_type) %>% 
  summarise(n = sum(value)) %>% 
  arrange(-n))$pop_type)

pop_type_major <- top_pop_types[1:4]
pop_type_other <- setdiff(top_pop_types, pop_type_major)

GetPplOfConcernSubset <- function(ds_unhr, wb_data, pop_set, new_indicator, new_name) {
  
  the_indicator <- paste0(
      "UNHCR.",
      ifelse(new_indicator=="", substr(pop_set[1], 1, 7), new_indicator),
      ".PCT")
  the_name = paste0(
      ifelse(new_name == "", pop_set[1], new_name), 
             "(% of population) - calculated")
  
  # print(pop_set)
  # print(the_indicator)
  # print(the_name)
  
  ds <- GetPreProcessUnhcrData(
    ds_unhcr = ds_unhcr %>%
      filter(pop_type %in% pop_set) %>%
      group_by(indicator, name, source, ess_year, cntry_name) %>%
      summarise(value = sum(value)) %>%
      ungroup(),
    ds_wb = wb_data$data,
    new_indicator = the_indicator,
    new_name = the_name)
  return(ds)  
}

unhcr_data <- NULL
unhcr_indicator_table <- NULL

for (pop_type in pop_type_major){
  ds_pop <- GetPplOfConcernSubset(ds_unhr, wb_data, pop_type, new_indicator="", new_name="")
  unhcr_data <- unhcr_data %>% bind_rows(ds_pop$data)
  unhcr_indicator_table <- unhcr_indicator_table %>% bind_rows(ds_pop$indicators)
}

ds_pop <- GetPplOfConcernSubset(ds_unhr, wb_data, pop_type_other, 
                                new_indicator="Other", 
                                new_name="Other population types together")
unhcr_data <- unhcr_data %>%
  bind_rows(ds_pop$data)
unhcr_indicator_table <- unhcr_indicator_table %>%
  bind_rows(ds_pop$indicators)


p <- ggplot(data=unhcr_data %>% filter(ess_year==2016), 
       aes(x=reorder(cntry_name, -value), y=value+1, fill=indicator)) +
  geom_bar_interactive(stat="identity", 
                       aes(tooltip=paste(cntry_name, '-', indicator))) +
  labs(title="People of concern in Europe 2016",
       subtitle="% of population, log10 scale yaxis showing 1% more than real",
       caption="UNHCR") +
  xlab("") + ylab("") + 
  scale_y_log10() +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
  coord_flip()
MyPrintInteractive(p, plot_interactive)
# merge data
ds_ss_avg <- ds_ss_avg %>%
  bind_rows(unhcr_data %>% 
              select(ess_year, cntry_name, indicator, value))

# merge definitions
indicator_table <- indicator_table %>%
  bind_rows(unhcr_indicator_table)

knitr::kable(indicator_table %>% filter(source=="UNHCR"), digits = 3)
indicator name min max source lblmin lblmax
UNHCR.Refugee.PCT Refugees (incl. refugee-like situations)(% of population) - calculated 0 2.319 UNHCR NA NA
UNHCR.Asylum-.PCT Asylum-seekers(% of population) - calculated 0 1.603 UNHCR NA NA
UNHCR.Interna.PCT Internally displaced persons(% of population) - calculated 0 4.000 UNHCR NA NA
UNHCR.Statele.PCT Stateless(% of population) - calculated 0 11.048 UNHCR NA NA
UNHCR.Other.PCT Other population types together(% of population) - calculated 0 6.276 UNHCR NA NA

Cross check with World Bank data

# check WB vs UNHCR data

p <- ggplot(data = ds_ss_avg %>% 
         filter(indicator %in% c("UNHCR.Refugee.PCT","REFG.PCT"),
                ess_year==2016),
       aes(x=reorder(cntry_name, -value), y=value, fill=indicator)) +
  geom_bar_interactive(stat = "identity", position = "dodge", aes(tooltip=paste(cntry_name, indicator, round(value,2)))) +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
  xlab("") + ylab("") + 
  coord_flip()

MyPrintInteractive(p, plot_interactive)

Looks good and reassuring.

Pct indicators

There are big variations in the sizes of countries included. Therefore looking into percentages is often more informative than absolute figures.

The following are indicators on percentages in World Bankand UNHCR data - or derived from them.

idx_pct <- grepl("%", indicator_table$name, fixed =TRUE)
pct_indicators <- indicator_table$indicator[idx_pct]
knitr::kable(indicator_table[idx_pct,], digits = 3)
indicator name min max source lblmin lblmax
BX.TRF.PWKR.DT.GD.ZS Personal remittances, received (% of GDP) 0.070 8.154 WB NA NA
GC.TAX.TOTL.GD.ZS Tax revenue (% of GDP) 1.230 48.344 WB NA NA
IT.NET.USER.ZS Individuals using the Internet (% of population) 1.874 97.298 WB NA NA
NY.GDP.MKTP.KD.ZG GDP growth (annual %) -14.814 25.557 WB NA NA
NY.GDP.PCAP.KD.ZG GDP per capita growth (annual %) -14.560 24.765 WB NA NA
REFG.PCT Refugees (% of population) - calculated 0.000 2.319 WB NA NA
SE.XPD.TOTL.GD.ZS Government expenditure on education, total (% of GDP) 2.326 8.560 WB NA NA
SH.XPD.GHED.GD.ZS Domestic general government health expenditure (% of GDP) 0.000 9.420 WB NA NA
SL.TLF.CACT.ZS Labor force participation rate, total (% of total population ages 15+) (modeled ILO estimate) 47.853 68.565 WB NA NA
SM.POP.TOTL.ZS International migrant stock (% of population) 0.795 29.387 WB NA NA
SP.RUR.TOTL.ZS Rural population (% of total population) 2.103 50.373 WB NA NA
SP.URB.TOTL.IN.ZS Urban population (% of total) 49.627 97.897 WB NA NA
UNHCR.Refugee.PCT Refugees (incl. refugee-like situations)(% of population) - calculated 0.000 2.319 UNHCR NA NA
UNHCR.Asylum-.PCT Asylum-seekers(% of population) - calculated 0.000 1.603 UNHCR NA NA
UNHCR.Interna.PCT Internally displaced persons(% of population) - calculated 0.000 4.000 UNHCR NA NA
UNHCR.Statele.PCT Stateless(% of population) - calculated 0.000 11.048 UNHCR NA NA
UNHCR.Other.PCT Other population types together(% of population) - calculated 0.000 6.276 UNHCR NA NA
# save a copy to be used by other scripts

saveRDS(indicator_table, file="indicator_table.rds")
saveRDS(ds_ss_avg, file="ds_ss_avg.rds")

ds_ss_avg_cntry_ind <- ds_ss_avg %>% 
  spread(indicator, value)

ds_ss_avg_ind_cntry <- ds_ss_avg %>% 
  spread(cntry_name, value)

Explore Extended Data

Investiating and learning the extended ESS data set spiced up with World Bank and UNHCR data.

Note that an interactive version of the graphs below is available at the time of writing as a shine application on web.

By indicator

Checking first what we can learn from by indicator view, i.e. how different are the trajectories of countries. Are there any curious outliers?

# need to construct a temporary rmd file which does not put javascript output inside 
# a loop construct - the temporary file gets read by the next chunk of the document

GetIndicatorPlotJs <- function(indicator_short, id, prefix) {
  # preparing an additional file to allow renderin of javascript, id and prefix not necessary
  c(
    " ",
    paste0("#### ", indicator_short),
    " ",
    "```{r, echo=FALSE, warning=FALSE}", 
    paste0("this_ind <- GetIndicator('", 
           indicator_short,
           "', indicator_table)"),
    paste0("p",prefix,id,"  <- MakeIndicatorCountryPlot("),
    "ds = ds_ss_avg_cntry_ind,",
    paste0("id_term = '",indicator_short,"',"),
    "cntry_term = 'cntry_name',",
    "year_term = 'ess_year',",
    "txt_head = as.character(this_ind$name),",
    "txt_subhead = '",indicator_short,"',",
    "txt_caption = paste(as.character(this_ind$source), '2002-2016'),",
    "show_summary = TRUE,",
    "plevel = p_val)",
    "if (!is.na(this_ind$lblmin)){",
    paste0("p",prefix,id,"$ci <- p",prefix,id,"$ci +"),
    "ylab(paste(this_ind$lblmin, this_ind$min, '<-->', this_ind$lblmax, this_ind$max)) +",
    "theme(axis.title.y = element_text(size = 7))",
    "}",
    paste0("MyPrintInteractive(p",prefix,id,"$ci, plot_interactive, hover_css = 'stroke-width:3px;') "),
    " ",
    "```", 
    " "
  )
}

if (params$explore_indicator_graphs){
  ascript <- NULL
  for (indicator_short in unique(ds_ss_avg$indicator)){
    ascript <- c(ascript, GetIndicatorPlotJs(indicator_short, length(ascript), "ind"))
  }
} else {
  ascript <- c("", "Exploratory graphs skipped to keep the document shorter.", " ")
}

tmp_indicators_rmd <- "tmp_indicators.rmd"
if (file.exists(tmp_indicators_rmd)) file.remove(tmp_indicators_rmd)
## [1] TRUE
fileConn<-file(tmp_indicators_rmd)
writeLines(c(
  " ",
  ascript, 
  " "
  ), fileConn)
close(fileConn)

happy

health

imbgeco

impcntr

imueclt

imwbcnt

polintr

pplhlp

stfeco

stfedu

stfhlth

trstep

trstlgl

trstplc

trstun

eisced2

stfgov2

euftf2

SH.XPD.GHED.GD.ZS

NY.GDP.MKTP.KD

NY.GDP.MKTP.KD.ZG

NY.GDP.PCAP.KD

NY.GDP.PCAP.KD.ZG

SE.XPD.TOTL.GD.ZS

IT.NET.USER.ZS

SM.POP.TOTL.ZS

SL.TLF.CACT.ZS

BX.TRF.PWKR.DT.GD.ZS

SP.POP.TOTL

SM.POP.REFG

SP.RUR.TOTL.ZS

GC.TAX.TOTL.GD.ZS

SP.URB.TOTL.IN.ZS

REFG.PCT

UNHCR.Refugee.PCT

UNHCR.Asylum-.PCT

UNHCR.Interna.PCT

UNHCR.Statele.PCT

UNHCR.Other.PCT

By country

GetCountryPlotJs <- function(cntry) {
  # preparing an additional file to allow rendering of javascript
  c(
    " ",
    paste0("#### ", cntry),
    "```{r, echo=FALSE, warning=FALSE}", 
    " ",
    "p_ess  <- MakeIndicatorCountryPlot(",
    "  ds = ds_ss_avg_ind_cntry %>% ",
    "    filter(indicator %in% (indicator_table %>% ",
    "                             filter(source=='ESS'))$indicator) %>% ",
    "left_join(indicator_table %>% ",
    paste0("            mutate(long_name = gsub('",   # need to get rid of quotation mark!!
           "\\'",
           "','',name, fixed=TRUE)) %>% "),
    "            select(indicator, long_name),",
    "          by='indicator'),",
    paste0("  id_term = '",cntry,"',"),
    "  cntry_term = 'indicator',",
    "  cntry_term_long = 'long_name',",
    "  year_term = 'ess_year',",
    paste0("  txt_head = '",cntry,"',"),
    "  txt_subhead = 'ESS survey indicators',",
    "  txt_caption = 'ESS 2002-2016',",
    "  show_summary = FALSE,",
    "  label_size_coef = lbl_coef,",
    "  plevel = p_val)",
    " ",
    "p_wb_a  <- MakeIndicatorCountryPlot(",
    "  ds = ds_ss_avg_ind_cntry %>% ",
    "    filter(indicator %in% (indicator_table[idx_pct,] %>% ",
    "                             filter(max > pct_max_split))$indicator) %>%",
    "left_join(indicator_table %>% ",
    "            mutate(long_name = name) %>% ",
    "            select(indicator, long_name),",
    "          by='indicator'),",
    paste0("  id_term = '",cntry,"',"),
    "  cntry_term = 'indicator',",
    "  cntry_term_long = 'long_name',",
    "  year_term = 'ess_year',",
    paste0("  txt_head = '",cntry,"',"),
    "  txt_subhead = 'World Bank Development Indices & UNHCR',",
    "  txt_caption = 'World Bank & UNHCR 2002-16',",
    "  show_summary = FALSE,",
    "  label_size_coef = lbl_coef,",
    "  plevel = p_val)",
    " ",
    "p_wb_b  <- MakeIndicatorCountryPlot(",
    "  ds = ds_ss_avg_ind_cntry %>% ",
    "    filter(indicator %in% (indicator_table[idx_pct,] %>% ",
    "                             filter(max <= pct_max_split))$indicator) %>%",
    "left_join(indicator_table %>% ",
    "            mutate(long_name = name) %>% ",
    "            select(indicator, long_name),",
    "          by='indicator'),",
    paste0("  id_term = '",cntry,"',"),
    "  cntry_term = 'indicator',",
    "  cntry_term_long = 'long_name',",
    "  year_term = 'ess_year',",
    paste0("  txt_head = '",cntry,"',"),
    "  txt_subhead = 'World Bank Development Indices & UNHCR',",
    "  txt_caption = 'World Bank & UNHCR 2002-16',",
    "  show_summary = FALSE,",
    "  label_size_coef = lbl_coef,",
    "  plevel = p_val)",
    " ",
    "MyPrintInteractive(p_ess$ci, plot_interactive, hover_css = 'stroke-width:3px;') ",
    "MyPrintInteractive(p_wb_a$ci, plot_interactive, hover_css = 'stroke-width:3px;') ",
    "MyPrintInteractive(p_wb_b$ci, plot_interactive, hover_css = 'stroke-width:3px;') ",
    "```", 
    " "
  )
}

lbl_coef <- 1
pct_max_split <- 30

if (params$explore_country_graphs){
  ascript <- NULL
  for (cntry in unique(ds_ss_avg$cntry_name)){
    ascript <- c(ascript, GetCountryPlotJs(cntry))
  } 
}else {
  ascript <- c("", "Exploratory graphs skipped to keep the document shorter.", " ")
}

tmp_countries_rmd <- "tmp_countries.rmd"
if (file.exists(tmp_countries_rmd)) file.remove(tmp_countries_rmd)
## [1] TRUE
fileConn<-file(tmp_countries_rmd)
writeLines(c(
  "",
  ascript, 
  ""
  ), fileConn)
close(fileConn)

Austria

Belgium

Bulgaria

Cyprus

Czechia

Denmark

Estonia

Finland

France

Germany

Greece

Hungary

Ireland

Israel

Italy

Lithuania

Netherlands

Norway

Poland

Portugal

Russia

Slovakia

Slovenia

Spain

Sweden

Switzerland

Ukraine

United Kingdom

Linear Modeling

We create a multiple linear regression model to get insight into associations between dependent and independent variables in our data.

As data we use the average scores of the indicators by country and by ESS survey. We check the performance of the model with the results of the 2016 survey.

In creating the model we will not use the year of a survey as an element. In other words, we assume that the relationship is independent of time.

At least for the time being.

Assumptions

Using multivariate linear model which assumes http://www.statisticssolutions.com/assumptions-of-multiple-linear-regression/

  • linearity - linear relationship between dependent and independent variables
  • multivariate normality - residuals are normally distributed
  • no multicollinearity - independent variables are not highly correlated with each other
  • homoskedasticity - variance of error terms similar across the values of independent variables

The following pairwise correlation graph gives a view into some of the above. And the diagnostic graphs produced in the model give more.

Multicollinearity was handled already earlier when we chose only one variable out of a set which was highly correlated with each other.

p_pairs <- ggpairs(
  ds_subset_50_ave %>% 
    select(-cntry_name, -ess_year),
  lower = list(
    continuous = wrap("points", alpha = 0.3,    size=0.1)
    ),
  upper = list(
    continuous = wrap("cor", size = 1.5, alignPercent = 1)
  )) +
  theme(strip.text.x = element_text(size = 5),
        strip.text.y = element_text(size = 4),
        panel.grid.major = element_blank(), 
        axis.text = element_text(size = 2.5))

print(p_pairs)

the_correlations <- rcorr(as.matrix(ds_subset_50_ave %>% 
    select(-cntry_name, -ess_year)))

diag(the_correlations$r) <- NA
print(paste("Highest correlation coefficient", round(max(abs(the_correlations$r), na.rm = TRUE),3)))
## [1] "Highest correlation coefficient 0.878"
# save space and compress correlation and p into the same matrix
CreateCorrPvalMatrix <- function(the_correlations) {
  the_correlations$r * lower.tri(the_correlations$r) +
    the_correlations$P * upper.tri(the_correlations$P)
}

knitr::kable(CreateCorrPvalMatrix(the_correlations),
             digits = 3, 
             caption = "Lower triangle: Correlation coefficients, upper triangle: p-values")
Lower triangle: Correlation coefficients, upper triangle: p-values
happy health imbgeco impcntr imueclt imwbcnt polintr pplhlp stfeco stfedu stfhlth trstep trstlgl trstplc trstun eisced2 stfgov2 euftf2
happy NA 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.003 0.000 0.000 0.000 0.467 0.000 0.000
health -0.676 NA 0.000 0.017 0.009 0.000 0.000 0.000 0.000 0.000 0.000 0.056 0.000 0.000 0.000 0.370 0.000 0.000
imbgeco 0.529 -0.288 NA 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.241 0.000 0.944
impcntr -0.398 0.175 -0.732 NA 0.000 0.000 0.000 0.000 0.000 0.005 0.003 0.066 0.024 0.000 0.000 0.341 0.001 0.155
imueclt 0.556 -0.190 0.775 -0.707 NA 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.888 0.000 0.634
imwbcnt 0.561 -0.362 0.813 -0.747 0.863 NA 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.018 0.000 0.291
polintr -0.564 0.295 -0.517 0.482 -0.531 -0.547 NA 0.000 0.000 0.000 0.000 0.255 0.000 0.000 0.000 0.000 0.000 0.041
pplhlp 0.765 -0.531 0.504 -0.360 0.497 0.545 -0.624 NA 0.000 0.000 0.000 0.082 0.000 0.000 0.000 0.000 0.000 0.000
stfeco 0.754 -0.467 0.551 -0.350 0.447 0.493 -0.542 0.718 NA 0.000 0.000 0.000 0.000 0.000 0.000 0.007 0.000 0.000
stfedu 0.734 -0.494 0.366 -0.207 0.432 0.448 -0.385 0.689 0.737 NA 0.000 0.000 0.000 0.000 0.000 0.006 0.000 0.000
stfhlth 0.790 -0.539 0.367 -0.217 0.367 0.314 -0.432 0.598 0.627 0.676 NA 0.015 0.000 0.000 0.000 0.945 0.000 0.000
trstep 0.214 -0.140 0.254 -0.134 0.280 0.274 -0.084 0.128 0.403 0.430 0.178 NA 0.000 0.000 0.000 0.899 0.000 0.000
trstlgl 0.785 -0.611 0.418 -0.165 0.397 0.401 -0.620 0.780 0.783 0.704 0.701 0.365 NA 0.000 0.000 0.266 0.000 0.000
trstplc 0.809 -0.563 0.499 -0.312 0.552 0.484 -0.506 0.747 0.688 0.717 0.687 0.330 0.878 NA 0.000 0.144 0.000 0.000
trstun 0.529 -0.254 0.499 -0.476 0.645 0.600 -0.396 0.626 0.629 0.658 0.360 0.609 0.618 0.687 NA 0.671 0.000 0.004
eisced2 -0.054 0.066 0.086 -0.070 -0.010 0.173 -0.281 0.273 0.197 0.200 -0.005 -0.009 0.082 -0.107 0.031 NA 0.027 0.128
stfgov2 0.666 -0.441 0.450 -0.238 0.357 0.398 -0.482 0.619 0.871 0.693 0.600 0.486 0.761 0.617 0.554 0.162 NA 0.000
euftf2 -0.356 0.276 -0.005 -0.104 -0.035 0.078 0.149 -0.576 -0.316 -0.357 -0.426 0.262 -0.474 -0.448 -0.210 -0.112 -0.290 NA

The model

Basically, we start with a linear model with all variables and let 2-way stepAIC in MASS library to make the selection of the final variables.

library(ggfortify)

MakeModel <- function(indata, response_var) {
  # create the model
  ds <- na.omit(indata)
  n_filtered <- nrow(indata) - nrow(ds)
  print(paste0("Linear modeling with 2-way stepAIC: Using ", nrow(ds), 
               ", removed ", n_filtered, " rows with NA"))
  # country name and year used as rownames to get informative model diagnostics text
  rownames(ds) <- paste(ds$cntry_name, ds$ess_year)
  ds <- ds %>% select(-cntry_name, -ess_year)
  # use 'response_var' as the name of the predicted / dependent variable internally
  # in the model
  colnames(ds)[colnames(ds)==response_var] <- "response_var"
  # initial model and refinement with stepAIC
  mymodel <- lm(formula = response_var~., data=ds)
  mymodel_final <- MASS::stepAIC(mymodel, direction = "both", trace = 0)
  return(mymodel_final)
}

PredictWithModel <- function(ds, mod, response_var, pred_level=0.95){
  # make prediction
  colnames(ds)[colnames(ds)==response_var] <- "response_var"
  y_pred <- predict(mod, newdata=ds, interval="predict",level = pred_level)
  return(list("y_pred"=y_pred[,1], "pred_int"=y_pred[,c(2,3)]))
}

CheckModel <- function(labels, ds, pred_val, response_var, k, txt_head=""){
  # show performance of the model
  colnames(ds)[colnames(ds)==response_var] <- "response_var"
  rsq <- r_squared(ds$response_var, pred_val$y_pred)
  rsqa <- r_squared_a(ds$response_var, pred_val$y_pred, k)
  
  # https://stackoverflow.com/questions/39811656/r-calculate-test-mse-given-a-trained-model-from-a-training-set-and-a-test-set
  MSPE <- mean((ds$response_var - pred_val$y_pred) ^ 2)
  
  print(paste0("For ", response_var))
  print(paste0("Mean squared prediction error: ", round(MSPE, 3)))
  print(paste0("                    R-squared: ", round(rsq, 3)))
  print(paste0("           Adjusted R-squared: ", round(rsqa, 3)))

  ds_result <- data.frame(country = labels,
                          true = ds$response_var,
                          predict = pred_val$y_pred,
                          lower = pred_val$pred_int[,1],
                          upper = pred_val$pred_int[,2])
  
  knitr::kable(ds_result,
               caption = paste("True and predicted", response_var, 
                               "with 95% prediction intervals"),
               digits = 3)

  # visualize the same  
  p <- ggplot(data=ds_result, aes(x=reorder(country,predict))) +
    geom_point(aes(y=predict), color="blue") +
    geom_errorbar(aes(ymin=lower, ymax=upper), color="blue", width=0.5) + 
    geom_point(aes(y=true), color="red", shape=4, size=3) +
    labs(title=paste(response_var, txt_head),
         subtitle="True scores (red), predicted with 95% prediction interval (blue) ") +
    xlab("") + ylab("") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1),
          plot.title = element_text(hjust = 0),
          plot.subtitle = element_text(hjust = 0))
  print(p)
}

GetModelCoeffs <- function(the_mod, pred_name){
  txt_title <- paste(pred_name, "linear model coefficients")
  txt_subtitle  <- "Coefficients shown in descending order of absolute value"
  the_set <- summary(the_mod)$coefficients[-1,]
  df_set <- data.frame(variable = rownames(the_set),
                       coefficient = the_set[,1],
                       std_err = the_set[,2],
                       t_val = the_set[,3],
                       p_val = the_set[,4])
  p <- ggplot(data = df_set, 
              aes(x=reorder(variable, -abs(coefficient)), 
                  y=coefficient, fill=p_val)) +
    geom_bar(stat = "identity") +
    xlab("") + ylab("") +
    labs(title=txt_title,
         subtitle=txt_subtitle) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))
  return(list("plot"=p, "df"=df_set, "title"= txt_title, "subtitle"=txt_subtitle))
}


# 
GetModelCrossCorr <- function(ds, predict_this, factor_names) {
  p_pairs <- ggpairs(
    ds %>% 
      select(c(predict_this, as.character(factor_names))),
    lower = list(
      continuous = wrap("points", alpha = 0.3,    size=0.1)
    ),
    upper = list(
      continuous = wrap("cor", size = 1.5, alignPercent = 1)
    )) +
    theme(strip.text.x = element_text(size = 5),
          strip.text.y = element_text(size = 4),
          panel.grid.major = element_blank(), 
          axis.text = element_text(size = 2.5))
  return(p_pairs)
}

PredAndCheck <- function(predict_this, ds_train, ds_test, txt_head){
  # wrap the whole creating, predicting and illustrating into one function 
  print(paste("Predicted indicator =", predict_this))
  mod <- MakeModel(ds_train, predict_this)
  print(summary(mod))
  p <- autoplot(mod, label.size=3) + 
    labs(subtitle = predict_this)
  print(p)
  pred <- PredictWithModel(ds_test, mod, predict_this)
  CheckModel(ds_test$cntry_name, ds_test, pred, predict_this, mod$rank, txt_head)
  mod_coeffs <- GetModelCoeffs(mod, predict_this)
  pair_cc <- GetModelCrossCorr(
    ds = ds_train,
    predict_this = predict_this,
    factor_names = mod_coeffs$df$variable)
  
  return(list("model"=mod, "coeffs"=mod_coeffs, "paircc"=pair_cc))
}

ShowCoeffSummary <- function(coeffs, indicator_table) {
  print(coeffs$plot)
  knitr::kable(coeffs$df %>% 
                 arrange(desc(abs(coefficient))) %>% 
                 mutate(name = lapply(variable, 
                                      function(x) GetIndicator(as.character(x),
                                                               indicator_table)$name)), 
               row.names = FALSE, 
               caption = paste0(coeffs$title, "\n", coeffs$subtitle),
               digits = 3)
}

Scale the data

This makes interpreting the results easier since the variables have different numeric ranges, e.g. impcntrl is from 1 to 4 and trstep between zero and 10.

In all fairness this is not a big deal as the scale used by ESS is somewhat arbitrary to begin with.

We use the countries which were in 50% or more of the surveys.

ds_subset_50_ave_s <- ds_subset_50_ave %>%
  gather(indicator, value, setdiff(colnames(ds_subset_50_ave), metadata_columns)) %>%
  filter(!is.na(value)) %>%
  group_by(indicator) %>% 
  mutate(scaled_ave = scale(value)) %>%
  select(-value) %>%
  spread(indicator, scaled_ave) %>%
  ungroup()

summary(ds_subset_50_ave_s)
##     ess_year     cntry_name           eisced2             euftf2        
##  Min.   :2002   Length:187         Min.   :-2.99412   Min.   :-2.17922  
##  1st Qu.:2006   Class :character   1st Qu.:-0.45482   1st Qu.:-0.70849  
##  Median :2010   Mode  :character   Median : 0.04543   Median :-0.09494  
##  Mean   :2009                      Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:2012                      3rd Qu.: 0.49712   3rd Qu.: 0.63075  
##  Max.   :2016                      Max.   : 3.01198   Max.   : 3.31707  
##      happy             health            imbgeco        
##  Min.   :-2.7587   Min.   :-1.88898   Min.   :-2.78500  
##  1st Qu.:-0.6616   1st Qu.:-0.82145   1st Qu.:-0.60922  
##  Median : 0.1827   Median :-0.02471   Median : 0.07823  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.8078   3rd Qu.: 0.66210   3rd Qu.: 0.74118  
##  Max.   : 1.6158   Max.   : 2.68904   Max.   : 2.02665  
##     impcntr            imueclt           imwbcnt            polintr       
##  Min.   :-2.45929   Min.   :-2.6543   Min.   :-2.80817   Min.   :-2.0150  
##  1st Qu.:-0.60876   1st Qu.:-0.5037   1st Qu.:-0.61221   1st Qu.:-0.8028  
##  Median :-0.07669   Median : 0.1249   Median :-0.06608   Median : 0.1184  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.55514   3rd Qu.: 0.6258   3rd Qu.: 0.77684   3rd Qu.: 0.6015  
##  Max.   : 2.97203   Max.   : 2.1361   Max.   : 2.47328   Max.   : 2.0882  
##      pplhlp            stfeco             stfedu        
##  Min.   :-2.1816   Min.   :-2.20550   Min.   :-2.40000  
##  1st Qu.:-0.7399   1st Qu.:-0.71337   1st Qu.:-0.64911  
##  Median :-0.1158   Median :-0.02312   Median :-0.05915  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.9393   3rd Qu.: 0.62509   3rd Qu.: 0.67089  
##  Max.   : 2.4904   Max.   : 2.44404   Max.   : 2.39419  
##     stfgov2            stfhlth             trstep       
##  Min.   :-2.51979   Min.   :-2.20712   Min.   :-3.1953  
##  1st Qu.:-0.65153   1st Qu.:-0.88885   1st Qu.:-0.7404  
##  Median : 0.03174   Median : 0.08731   Median : 0.1725  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.69899   3rd Qu.: 0.79932   3rd Qu.: 0.7218  
##  Max.   : 2.23489   Max.   : 1.87377   Max.   : 2.5117  
##     trstlgl            trstplc            trstun        
##  Min.   :-2.57154   Min.   :-3.2652   Min.   :-3.09399  
##  1st Qu.:-0.70928   1st Qu.:-0.6623   1st Qu.:-0.54098  
##  Median : 0.01146   Median : 0.1501   Median :-0.07194  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.75589   3rd Qu.: 0.6462   3rd Qu.: 0.47075  
##  Max.   : 2.05895   Max.   : 1.9620   Max.   : 2.05203
test_year <- 2016

Training and testing data

We reserve the scores from 2016 for testing the model. All other data used for training.

And for testing only the countries with non imputed data for that year are used.

ds_train <- ds_subset_50_ave_s  %>% filter(ess_year < test_year)
dim(ds_train)
## [1] 165  20
# Test data - only use data we really have, not imputed stuff
test_countries <- ds_participations %>% 
  filter(last_year==test_year) %>% 
  select(cntry_name)
ds_test  <- ds_subset_50_ave_s %>% 
  filter(ess_year == test_year, 
         cntry_name %in% test_countries$cntry_name)
print(paste("Countries used as test data in",
            test_year,
            ":",
            paste0(unique(ds_test$cntry_name), collapse=", ")))
## [1] "Countries used as test data in 2016 : Austria, Belgium, Czechia, Estonia, Finland, France, Germany, Hungary, Ireland, Israel, Italy, Lithuania, Netherlands, Norway, Poland, Portugal, Russia, Slovenia, Spain, Sweden, Switzerland, United Kingdom"

Models and observations

Subjective Happiness

Investigating how the selected ESS variables are correlating with subjective happiness.

the_mod <- PredAndCheck("happy", ds_train, ds_test, "- train 2002-2014, test 2016")
## [1] "Predicted indicator = happy"
## [1] "Linear modeling with 2-way stepAIC: Using 165, removed 0 rows with NA"
## 
## Call:
## lm(formula = response_var ~ eisced2 + euftf2 + health + imbgeco + 
##     impcntr + imueclt + pplhlp + stfeco + stfedu + stfhlth + 
##     trstlgl + trstun, data = ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91964 -0.17910  0.00755  0.16397  0.87223 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.004687   0.026422  -0.177 0.859444    
## eisced2     -0.206142   0.031288  -6.589 6.85e-10 ***
## euftf2       0.091158   0.038312   2.379 0.018583 *  
## health      -0.129472   0.039452  -3.282 0.001280 ** 
## imbgeco     -0.130302   0.052480  -2.483 0.014118 *  
## impcntr     -0.153648   0.049034  -3.134 0.002073 ** 
## imueclt      0.193826   0.053345   3.633 0.000382 ***
## pplhlp       0.374918   0.060544   6.193 5.28e-09 ***
## stfeco       0.200033   0.053456   3.742 0.000258 ***
## stfedu       0.207086   0.054984   3.766 0.000236 ***
## stfhlth      0.212544   0.047699   4.456 1.61e-05 ***
## trstlgl      0.174122   0.063429   2.745 0.006778 ** 
## trstun      -0.295670   0.057682  -5.126 8.88e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3365 on 152 degrees of freedom
## Multiple R-squared:  0.8993, Adjusted R-squared:  0.8914 
## F-statistic: 113.2 on 12 and 152 DF,  p-value: < 2.2e-16

## [1] "For happy"
## [1] "Mean squared prediction error: 0.138"
## [1] "                    R-squared: 0.757"
## [1] "           Adjusted R-squared: 0.361"

ShowCoeffSummary(the_mod$coeffs, indicator_table)

happy linear model coefficients Coefficients shown in descending order of absolute value
variable coefficient std_err t_val p_val name
pplhlp 0.375 0.061 6.193 0.000 Most of the time people helpful or mostly looking out for themselves
trstun -0.296 0.058 -5.126 0.000 Trust in the United Nations
stfhlth 0.213 0.048 4.456 0.000 State of health services in country nowadays
stfedu 0.207 0.055 3.766 0.000 State of education in country nowadays
eisced2 -0.206 0.031 -6.589 0.000 Highest level of education, ES ISCED
stfeco 0.200 0.053 3.742 0.000 How satisfied with present state of economy in country
imueclt 0.194 0.053 3.633 0.000 Country’s cultural life undermined or enriched by immigrants
trstlgl 0.174 0.063 2.745 0.007 Trust in the legal system
impcntr -0.154 0.049 -3.134 0.002 Allow many/few immigrants from poorer countries outside Europe
imbgeco -0.130 0.052 -2.483 0.014 Immigration bad or good for country’s economy
health -0.129 0.039 -3.282 0.001 Subjective general health
euftf2 0.091 0.038 2.379 0.019 European Union: European unification go further or gone too far
print(the_mod$paircc)

the_mod_sum <- summary(the_mod$model)

The model explains 89.1% of variation.

Note that for impcntr Allow many/few immigrants from poorer countries outside Europe lower values of this indicator mean allowing more more immigration and higher values less. Favoring more immigration is associated with higher happiness and “closing the borders” associated with lower level of happiness

Helpful People

ESS results

ess_set <- GetObservationsSet(
  ds = ds_ss_avg,
  ds_obs = ds_subset_50,
  the_indicator = "pplhlp",
  indicator_table = indicator_table,
  cntry_list = cntry_in_100pct)
## Joining, by = "cntry_name"
print(ess_set$summary)
##     ess_year    cntry_name         indicator             value    
##  Min.   : NA   Length:0           Length:0           Min.   : NA  
##  1st Qu.: NA   Class :character   Class :character   1st Qu.: NA  
##  Median : NA   Mode  :character   Mode  :character   Median : NA  
##  Mean   :NaN                                         Mean   :NaN  
##  3rd Qu.: NA                                         3rd Qu.: NA  
##  Max.   : NA                                         Max.   : NA
MyPrintInteractive(ess_set$map, plot_interactive) 
MyPrintInteractive(ess_set$heat, plot_interactive) 
MyPrintInteractive(ess_set$sig$plot, plot_interactive) 
tab.3_cap <- table_nums(name="tab_3",
                        caption = paste0(ess_set$indicator$name, "2002-2016, significance and effect size of the change"))

Tab. 3: Most of the time people helpful or mostly looking out for themselves2002-2016, significance and effect size of the change

knitr::kable(ess_set$sig$table, digits = 3, 
             col.names = gsub("_", " ", colnames(ess_set$sig$table)))
cntry name 2002 2014 2016 d0216 eff cil 0216 eff ciu 0216 eff m 0216 p 0216 sig 0216 d1416 eff cil 1416 eff ciu 1416 eff m 1416 p 1416 sig 1416
Spain 4.406 4.376 4.381 -0.025 0.008 0.012 negligible 0.740 0.005 0.008 0.012 negligible 0.958
Sweden 6.012 6.042 6.025 0.012 0.061 0.065 negligible 0.904 -0.017 0.061 0.065 negligible 0.798
Ireland 5.879 5.861 6.046 0.167 -0.047 -0.043 negligible 0.010 ** 0.185 -0.047 -0.043 negligible 0.004 **
Portugal 3.937 4.011 4.192 0.255 -0.057 -0.053 negligible 0.004 ** 0.181 -0.057 -0.053 negligible 0.040 *
Belgium 4.392 4.637 4.683 0.291 -0.107 -0.103 negligible 0.000 *** 0.046 -0.107 -0.103 negligible 0.514
United Kingdom 5.408 5.849 5.754 0.346 0.242 0.246 small 0.000 *** -0.094 0.242 0.246 small 0.136
Finland 5.676 5.933 6.110 0.434 -0.191 -0.187 negligible 0.000 *** 0.178 -0.191 -0.187 negligible 0.002 **
France 4.345 4.667 4.807 0.462 -0.106 -0.102 negligible 0.000 *** 0.140 -0.106 -0.102 negligible 0.026 *
Switzerland 5.293 5.684 5.759 0.466 -0.167 -0.163 negligible 0.000 *** 0.075 -0.167 -0.163 negligible 0.306
Hungary 4.147 4.368 4.616 0.469 -0.085 -0.081 negligible 0.000 *** 0.248 -0.085 -0.081 negligible 0.000 ***
Netherlands 5.236 5.593 5.706 0.470 -0.132 -0.128 negligible 0.000 *** 0.114 -0.132 -0.128 negligible 0.044 *
Germany 4.801 5.240 5.424 0.623 -0.130 -0.127 negligible 0.000 *** 0.184 -0.130 -0.127 negligible 0.000 ***
Poland 3.163 3.664 3.830 0.666 -0.277 -0.273 small 0.000 *** 0.165 -0.277 -0.273 small 0.040 *
Slovenia 4.233 4.967 5.167 0.934 -0.362 -0.357 small 0.000 *** 0.200 -0.362 -0.357 small 0.026 *
Norway 6.013 6.048 6.984 0.971 -0.307 -0.303 small 0.000 *** 0.936 -0.307 -0.303 small 0.000 ***

Modeling

Investigating the highest factor in happy model - and excluding happy from the model as I do not think it can be an independent variable at all.

the_mod <- PredAndCheck("pplhlp", 
                        ds_train %>% select(-happy), 
                        ds_test%>% select(-happy), 
                        "- train 2002-2014, test 2016")
## [1] "Predicted indicator = pplhlp"
## [1] "Linear modeling with 2-way stepAIC: Using 165, removed 0 rows with NA"
## 
## Call:
## lm(formula = response_var ~ eisced2 + euftf2 + health + impcntr + 
##     imueclt + imwbcnt + polintr + stfeco + trstep + trstplc + 
##     trstun, data = ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.02574 -0.23929 -0.01781  0.22439  1.12532 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.000608   0.031304   0.019 0.984529    
## eisced2      0.106774   0.041411   2.578 0.010868 *  
## euftf2      -0.319470   0.054215  -5.893 2.34e-08 ***
## health      -0.090918   0.049221  -1.847 0.066658 .  
## impcntr      0.076419   0.051308   1.489 0.138438    
## imueclt     -0.173397   0.084618  -2.049 0.042153 *  
## imwbcnt      0.316904   0.092203   3.437 0.000758 ***
## polintr     -0.191909   0.045364  -4.230 4.00e-05 ***
## stfeco       0.146637   0.051003   2.875 0.004615 ** 
## trstep      -0.150051   0.053838  -2.787 0.005992 ** 
## trstplc      0.173614   0.070795   2.452 0.015318 *  
## trstun       0.298725   0.067928   4.398 2.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3949 on 153 degrees of freedom
## Multiple R-squared:  0.8554, Adjusted R-squared:  0.845 
## F-statistic: 82.25 on 11 and 153 DF,  p-value: < 2.2e-16

## [1] "For pplhlp"
## [1] "Mean squared prediction error: 0.249"
## [1] "                    R-squared: 0.681"
## [1] "           Adjusted R-squared: 0.255"

ShowCoeffSummary(the_mod$coeffs, indicator_table)

pplhlp linear model coefficients Coefficients shown in descending order of absolute value
variable coefficient std_err t_val p_val name
euftf2 -0.319 0.054 -5.893 0.000 European Union: European unification go further or gone too far
imwbcnt 0.317 0.092 3.437 0.001 Immigrants make country worse or better place to live
trstun 0.299 0.068 4.398 0.000 Trust in the United Nations
polintr -0.192 0.045 -4.230 0.000 How interested in politics
trstplc 0.174 0.071 2.452 0.015 Trust in the police
imueclt -0.173 0.085 -2.049 0.042 Country’s cultural life undermined or enriched by immigrants
trstep -0.150 0.054 -2.787 0.006 Trust in the European Parliament
stfeco 0.147 0.051 2.875 0.005 How satisfied with present state of economy in country
eisced2 0.107 0.041 2.578 0.011 Highest level of education, ES ISCED
health -0.091 0.049 -1.847 0.067 Subjective general health
impcntr 0.076 0.051 1.489 0.138 Allow many/few immigrants from poorer countries outside Europe
print(the_mod$paircc)

the_mod_sum <- summary(the_mod$model)

The model explains 84.5% of variation.

Looks like here happy and trstun are major positive factors. trstun was seen as a major factor in happy as well. The third biggest positive factor is imwbcnt Immigrants make country worse or better place to live.

Interestingly, this is negatively correlated with the euftf European Union: European unification go further or gone too far. And there is a negative correlation with imueclt Country’s cultural life undermined or enriched by immigrants.

Trust in United Nations

Trust in the UN seems to be an important factor, so let’s drill deeper.

ESS results

ess_set <- GetObservationsSet(
  ds = ds_ss_avg,
  ds_obs = ds_subset_50,
  the_indicator = "trstun",
  indicator_table = indicator_table,
  cntry_list = cntry_in_100pct)
## Joining, by = "cntry_name"
print(ess_set$summary)
##     ess_year    cntry_name         indicator             value    
##  Min.   : NA   Length:0           Length:0           Min.   : NA  
##  1st Qu.: NA   Class :character   Class :character   1st Qu.: NA  
##  Median : NA   Mode  :character   Mode  :character   Median : NA  
##  Mean   :NaN                                         Mean   :NaN  
##  3rd Qu.: NA                                         3rd Qu.: NA  
##  Max.   : NA                                         Max.   : NA
MyPrintInteractive(ess_set$map, plot_interactive) 
MyPrintInteractive(ess_set$heat, plot_interactive) 
MyPrintInteractive(ess_set$sig$plot, plot_interactive) 
tab.4_cap <- table_nums(name="tab_4",
                        caption = paste0(ess_set$indicator$name, "2002-2016, significance and effect size of the change"))

Tab. 4: Trust in the United Nations2002-2016, significance and effect size of the change

knitr::kable(ess_set$sig$table, digits = 3, 
             col.names = gsub("_", " ", colnames(ess_set$sig$table)))
cntry name 2002 2014 2016 d0216 eff cil 0216 eff ciu 0216 eff m 0216 p 0216 sig 0216 d1416 eff cil 1416 eff ciu 1416 eff m 1416 p 1416 sig 1416
Hungary 5.972 5.369 4.934 -1.038 0.422 0.426 small 0.000 *** -0.435 0.422 0.426 small 0.000 ***
Poland 5.634 4.604 4.909 -0.725 0.247 0.252 small 0.000 *** 0.305 0.247 0.252 small 0.000 ***
Slovenia 4.876 4.137 4.337 -0.539 0.186 0.191 negligible 0.000 *** 0.200 0.186 0.191 negligible 0.044 *
Sweden 6.603 6.411 6.242 -0.361 0.159 0.164 negligible 0.000 *** -0.169 0.159 0.164 negligible 0.024 *
Germany 5.170 4.680 4.812 -0.358 0.125 0.128 negligible 0.000 *** 0.132 0.125 0.128 negligible 0.030 *
Switzerland 5.440 5.363 5.267 -0.173 0.042 0.046 negligible 0.032 * -0.096 0.042 0.046 negligible 0.288
Finland 6.462 6.148 6.353 -0.109 0.047 0.051 negligible 0.088 + 0.205 0.047 0.051 negligible 0.002 **
Portugal 5.356 4.815 5.356 0.000 0.013 0.018 negligible 0.992 0.541 0.013 0.018 negligible 0.000 ***
Ireland 5.713 5.445 5.716 0.003 -0.017 -0.013 negligible 0.996 0.271 -0.017 -0.013 negligible 0.000 ***
United Kingdom 5.317 4.909 5.360 0.044 0.282 0.287 small 0.598 0.452 0.282 0.287 small 0.000 ***
Norway 6.756 6.723 6.848 0.092 -0.032 -0.028 negligible 0.150 0.125 -0.032 -0.028 negligible 0.126
Spain 4.661 4.817 4.771 0.109 -0.014 -0.010 negligible 0.196 -0.047 -0.014 -0.010 negligible 0.580
Belgium 4.997 5.199 5.230 0.233 -0.099 -0.095 negligible 0.006 ** 0.031 -0.099 -0.095 negligible 0.658
Netherlands 5.470 5.534 5.732 0.262 -0.064 -0.059 negligible 0.000 *** 0.198 -0.064 -0.059 negligible 0.002 **
France 4.569 5.100 4.921 0.352 -0.074 -0.070 negligible 0.000 *** -0.179 -0.074 -0.070 negligible 0.022 *

Modeling

the_mod <- PredAndCheck("trstun", 
                        ds_train %>% select(-happy), 
                        ds_test%>% select(-happy), 
                        "- train 2002-2014, test 2016")
## [1] "Predicted indicator = trstun"
## [1] "Linear modeling with 2-way stepAIC: Using 165, removed 0 rows with NA"
## Warning: Setting row names on a tibble is deprecated.
## 
## Call:
## lm(formula = response_var ~ eisced2 + euftf2 + health + imbgeco + 
##     impcntr + imueclt + imwbcnt + polintr + pplhlp + stfeco + 
##     stfedu + stfgov2 + stfhlth + trstep + trstlgl, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6998 -0.2541 -0.0195  0.2020  1.0600 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02489    0.02874   0.866 0.387877    
## eisced2     -0.12943    0.03747  -3.454 0.000719 ***
## euftf2      -0.17519    0.05724  -3.061 0.002619 ** 
## health       0.23682    0.04601   5.147 8.21e-07 ***
## imbgeco     -0.24459    0.05818  -4.204 4.50e-05 ***
## impcntr     -0.33549    0.05301  -6.329 2.74e-09 ***
## imueclt      0.11603    0.08009   1.449 0.149503    
## imwbcnt      0.18293    0.09200   1.988 0.048597 *  
## polintr      0.11506    0.05341   2.154 0.032823 *  
## pplhlp       0.28953    0.07106   4.075 7.47e-05 ***
## stfeco       0.17257    0.07225   2.389 0.018169 *  
## stfedu       0.24172    0.05710   4.234 4.00e-05 ***
## stfgov2     -0.21773    0.06650  -3.274 0.001319 ** 
## stfhlth     -0.25712    0.04810  -5.346 3.32e-07 ***
## trstep       0.41812    0.04881   8.566 1.24e-14 ***
## trstlgl      0.36318    0.08104   4.482 1.47e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.366 on 149 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.8619 
## F-statistic: 69.23 on 15 and 149 DF,  p-value: < 2.2e-16

## [1] "For trstun"
## [1] "Mean squared prediction error: 0.239"
## [1] "                    R-squared: 0.809"
## [1] "           Adjusted R-squared: 0.2"

ShowCoeffSummary(the_mod$coeffs, indicator_table)

trstun linear model coefficients Coefficients shown in descending order of absolute value
variable coefficient std_err t_val p_val name
trstep 0.418 0.049 8.566 0.000 Trust in the European Parliament
trstlgl 0.363 0.081 4.482 0.000 Trust in the legal system
impcntr -0.335 0.053 -6.329 0.000 Allow many/few immigrants from poorer countries outside Europe
pplhlp 0.290 0.071 4.075 0.000 Most of the time people helpful or mostly looking out for themselves
stfhlth -0.257 0.048 -5.346 0.000 State of health services in country nowadays
imbgeco -0.245 0.058 -4.204 0.000 Immigration bad or good for country’s economy
stfedu 0.242 0.057 4.234 0.000 State of education in country nowadays
health 0.237 0.046 5.147 0.000 Subjective general health
stfgov2 -0.218 0.067 -3.274 0.001 How satisfied with the national government
imwbcnt 0.183 0.092 1.988 0.049 Immigrants make country worse or better place to live
euftf2 -0.175 0.057 -3.061 0.003 European Union: European unification go further or gone too far
stfeco 0.173 0.072 2.389 0.018 How satisfied with present state of economy in country
eisced2 -0.129 0.037 -3.454 0.001 Highest level of education, ES ISCED
imueclt 0.116 0.080 1.449 0.150 Country’s cultural life undermined or enriched by immigrants
polintr 0.115 0.053 2.154 0.033 How interested in politics
print(the_mod$paircc)

the_mod_sum <- summary(the_mod$model)

The model explains 86.2% of variation.

Trust in European Parliament

EES Results

ess_set <- GetObservationsSet(
  ds = ds_ss_avg,
  ds_obs = ds_subset_50,
  the_indicator = "trstep",
  indicator_table = indicator_table,
  cntry_list = cntry_in_100pct)
## Joining, by = "cntry_name"
print(ess_set$summary)
##     ess_year    cntry_name         indicator             value    
##  Min.   : NA   Length:0           Length:0           Min.   : NA  
##  1st Qu.: NA   Class :character   Class :character   1st Qu.: NA  
##  Median : NA   Mode  :character   Mode  :character   Median : NA  
##  Mean   :NaN                                         Mean   :NaN  
##  3rd Qu.: NA                                         3rd Qu.: NA  
##  Max.   : NA                                         Max.   : NA
MyPrintInteractive(ess_set$map, plot_interactive) 
MyPrintInteractive(ess_set$heat, plot_interactive) 
MyPrintInteractive(ess_set$sig$plot, plot_interactive) 
tab.5_cap <- table_nums(name="tab_5",
                        caption = paste0(ess_set$indicator$name, "2002-2016, significance and effect size of the change"))

Tab. 5: Trust in the European Parliament2002-2016, significance and effect size of the change

knitr::kable(ess_set$sig$table, digits = 3, 
             col.names = gsub("_", " ", colnames(ess_set$sig$table)))
cntry name 2002 2014 2016 d0216 eff cil 0216 eff ciu 0216 eff m 0216 p 0216 sig 0216 d1416 eff cil 1416 eff ciu 1416 eff m 1416 p 1416 sig 1416
Hungary 5.625 4.867 4.367 -1.258 0.523 0.528 medium 0.000 *** -0.501 0.523 0.528 medium 0.000 ***
Poland 4.741 3.665 3.927 -0.815 0.308 0.312 small 0.000 *** 0.262 0.308 0.312 small 0.006 **
Portugal 4.835 3.501 4.103 -0.733 0.145 0.150 negligible 0.000 *** 0.601 0.145 0.150 negligible 0.000 ***
Slovenia 4.620 3.389 3.944 -0.676 0.248 0.253 small 0.000 *** 0.555 0.248 0.253 small 0.000 ***
France 4.345 3.948 3.809 -0.536 0.130 0.134 negligible 0.000 *** -0.140 0.130 0.134 negligible 0.060 +
Spain 4.767 3.886 4.245 -0.522 0.177 0.182 negligible 0.000 *** 0.359 0.177 0.182 negligible 0.000 ***
Switzerland 4.812 4.525 4.501 -0.311 0.084 0.088 negligible 0.000 *** -0.024 0.084 0.088 negligible 0.818
Belgium 4.802 4.808 4.587 -0.215 0.043 0.047 negligible 0.008 ** -0.221 0.043 0.047 negligible 0.004 **
Ireland 5.132 4.626 5.004 -0.128 0.022 0.026 negligible 0.080 + 0.378 0.022 0.026 negligible 0.000 ***
Germany 4.526 4.079 4.400 -0.126 0.060 0.064 negligible 0.032 * 0.321 0.060 0.064 negligible 0.000 ***
Netherlands 4.777 4.487 4.675 -0.101 0.033 0.038 negligible 0.118 0.188 0.033 0.038 negligible 0.012 *
United Kingdom 3.692 3.270 3.776 0.084 0.201 0.205 small 0.288 0.507 0.201 0.205 small 0.000 ***
Finland 4.878 4.673 5.173 0.295 -0.115 -0.111 negligible 0.000 *** 0.501 -0.115 -0.111 negligible 0.000 ***
Norway 4.706 5.002 5.151 0.444 -0.143 -0.138 negligible 0.000 *** 0.148 -0.143 -0.138 negligible 0.056 +
Sweden 4.084 4.765 4.841 0.757 -0.126 -0.121 negligible 0.000 *** 0.076 -0.126 -0.121 negligible 0.362

As can be seen above, the big positive changes in UK and Portugal as well as Finland and Slovenia from 2014 to 2016 were significant as per weighted t-test.

And the same applies to quite large drops in Belgium, France and Hungary.

With more investigation one might find some interesting correlations between those changes and other indicators in ESS data and elsewhere. But that adventure is beyond the scope of this post.

Modeling

the_mod <- PredAndCheck("trstep", 
                        ds_train %>% select(-happy), 
                        ds_test%>% select(-happy), 
                        "- train 2002-2014, test 2016")
## [1] "Predicted indicator = trstep"
## [1] "Linear modeling with 2-way stepAIC: Using 165, removed 0 rows with NA"
## Warning: Setting row names on a tibble is deprecated.
## 
## Call:
## lm(formula = response_var ~ eisced2 + euftf2 + health + imbgeco + 
##     impcntr + imwbcnt + polintr + pplhlp + stfeco + stfgov2 + 
##     stfhlth + trstun, data = ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27108 -0.36267  0.08103  0.28691  1.20172 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.003374   0.039423  -0.086 0.931916    
## eisced2      0.115316   0.046252   2.493 0.013731 *  
## euftf2       0.494093   0.063408   7.792 9.61e-13 ***
## health      -0.132712   0.053542  -2.479 0.014280 *  
## imbgeco      0.240707   0.079032   3.046 0.002737 ** 
## impcntr      0.242026   0.066604   3.634 0.000381 ***
## imwbcnt     -0.281397   0.091003  -3.092 0.002365 ** 
## polintr      0.106323   0.057925   1.836 0.068383 .  
## pplhlp      -0.298004   0.097896  -3.044 0.002751 ** 
## stfeco      -0.250087   0.096764  -2.585 0.010693 *  
## stfgov2      0.510377   0.082850   6.160 6.21e-09 ***
## stfhlth      0.123759   0.059742   2.072 0.039996 *  
## trstun       0.872817   0.070352  12.406  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5025 on 152 degrees of freedom
## Multiple R-squared:  0.7582, Adjusted R-squared:  0.7391 
## F-statistic: 39.71 on 12 and 152 DF,  p-value: < 2.2e-16

## [1] "For trstep"
## [1] "Mean squared prediction error: 0.354"
## [1] "                    R-squared: 0.69"
## [1] "           Adjusted R-squared: 0.187"

ShowCoeffSummary(the_mod$coeffs, indicator_table)

trstep linear model coefficients Coefficients shown in descending order of absolute value
variable coefficient std_err t_val p_val name
trstun 0.873 0.070 12.406 0.000 Trust in the United Nations
stfgov2 0.510 0.083 6.160 0.000 How satisfied with the national government
euftf2 0.494 0.063 7.792 0.000 European Union: European unification go further or gone too far
pplhlp -0.298 0.098 -3.044 0.003 Most of the time people helpful or mostly looking out for themselves
imwbcnt -0.281 0.091 -3.092 0.002 Immigrants make country worse or better place to live
stfeco -0.250 0.097 -2.585 0.011 How satisfied with present state of economy in country
impcntr 0.242 0.067 3.634 0.000 Allow many/few immigrants from poorer countries outside Europe
imbgeco 0.241 0.079 3.046 0.003 Immigration bad or good for country’s economy
health -0.133 0.054 -2.479 0.014 Subjective general health
stfhlth 0.124 0.060 2.072 0.040 State of health services in country nowadays
eisced2 0.115 0.046 2.493 0.014 Highest level of education, ES ISCED
polintr 0.106 0.058 1.836 0.068 How interested in politics
print(the_mod$paircc)

the_mod_sum <- summary(the_mod$model)

The model explains 73.9% of variation.

Immigrants from poorer countries outside Europe

EES Results

ess_set <- GetObservationsSet(
  ds = ds_ss_avg,
  ds_obs = ds_subset_50,
  the_indicator = "impcntr",
  indicator_table = indicator_table,
  cntry_list = cntry_in_100pct)
## Joining, by = "cntry_name"
print(ess_set$summary)
##     ess_year    cntry_name         indicator             value    
##  Min.   : NA   Length:0           Length:0           Min.   : NA  
##  1st Qu.: NA   Class :character   Class :character   1st Qu.: NA  
##  Median : NA   Mode  :character   Mode  :character   Median : NA  
##  Mean   :NaN                                         Mean   :NaN  
##  3rd Qu.: NA                                         3rd Qu.: NA  
##  Max.   : NA                                         Max.   : NA
MyPrintInteractive(ess_set$map, plot_interactive) 
MyPrintInteractive(ess_set$heat, plot_interactive) 
MyPrintInteractive(ess_set$sig$plot, plot_interactive) 
tab.6_cap <- table_nums(name="tab_6",
                        caption = paste0(ess_set$indicator$name, "2002-2016, significance and effect size of the change"))

Tab. 6: Allow many/few immigrants from poorer countries outside Europe2002-2016, significance and effect size of the change

knitr::kable(ess_set$sig$table, digits = 3, 
             col.names = gsub("_", " ", colnames(ess_set$sig$table)))
cntry name 2002 2014 2016 d0216 eff cil 0216 eff ciu 0216 eff m 0216 p 0216 sig 0216 d1416 eff cil 1416 eff ciu 1416 eff m 1416 p 1416 sig 1416
Portugal 2.791 2.607 2.220 -0.571 0.231 0.236 small 0.000 *** -0.387 0.231 0.236 small 0.000 ***
Spain 2.402 2.379 2.112 -0.291 0.212 0.217 small 0.000 *** -0.268 0.212 0.217 small 0.000 ***
United Kingdom 2.575 2.685 2.304 -0.271 0.525 0.529 medium 0.000 *** -0.381 0.525 0.529 medium 0.000 ***
Norway 2.304 2.181 2.036 -0.268 0.233 0.237 small 0.000 *** -0.144 0.233 0.237 small 0.000 ***
Germany 2.390 2.233 2.162 -0.228 0.171 0.174 negligible 0.000 *** -0.071 0.171 0.174 negligible 0.002 **
France 2.559 2.594 2.357 -0.202 0.121 0.125 negligible 0.000 *** -0.236 0.121 0.125 negligible 0.000 ***
Belgium 2.482 2.580 2.293 -0.189 0.136 0.140 negligible 0.000 *** -0.286 0.136 0.140 negligible 0.000 ***
Netherlands 2.467 2.505 2.323 -0.144 0.095 0.099 negligible 0.000 *** -0.182 0.095 0.099 negligible 0.000 ***
Finland 2.603 2.697 2.511 -0.092 0.093 0.097 negligible 0.000 *** -0.186 0.093 0.097 negligible 0.000 ***
Sweden 1.879 1.763 1.802 -0.077 0.120 0.124 negligible 0.004 ** 0.039 0.120 0.124 negligible 0.124
Slovenia 2.468 2.531 2.490 0.022 -0.025 -0.020 negligible 0.502 -0.041 -0.025 -0.020 negligible 0.246
Ireland 2.297 2.657 2.332 0.035 -0.027 -0.023 negligible 0.136 -0.325 -0.027 -0.023 negligible 0.000 ***
Switzerland 2.218 2.400 2.283 0.065 -0.060 -0.055 negligible 0.016 * -0.117 -0.060 -0.055 negligible 0.000 ***
Poland 2.409 2.455 2.591 0.182 -0.218 -0.213 small 0.000 *** 0.135 -0.218 -0.213 small 0.000 ***
Hungary 3.101 3.318 3.558 0.457 -0.210 -0.205 small 0.000 *** 0.240 -0.210 -0.205 small 0.000 ***

Modeling

the_mod <- PredAndCheck("impcntr", 
                        ds_train %>% select(-happy), 
                        ds_test%>% select(-happy), 
                        "- train 2002-2014, test 2016")
## [1] "Predicted indicator = impcntr"
## [1] "Linear modeling with 2-way stepAIC: Using 165, removed 0 rows with NA"
## 
## Call:
## lm(formula = response_var ~ eisced2 + health + imbgeco + imwbcnt + 
##     polintr + pplhlp + stfedu + stfgov2 + stfhlth + trstep + 
##     trstlgl + trstplc + trstun, data = ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12877 -0.33126 -0.06624  0.31672  1.76845 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01938    0.03862   0.502 0.616504    
## eisced2     -0.07870    0.05335  -1.475 0.142246    
## health       0.23705    0.05823   4.071 7.53e-05 ***
## imbgeco     -0.33030    0.07333  -4.504 1.33e-05 ***
## imwbcnt     -0.19156    0.08198  -2.337 0.020768 *  
## polintr      0.35827    0.06210   5.770 4.35e-08 ***
## pplhlp       0.17963    0.09140   1.965 0.051217 .  
## stfedu       0.31052    0.08033   3.866 0.000164 ***
## stfgov2     -0.22599    0.07630  -2.962 0.003552 ** 
## stfhlth     -0.18458    0.06570  -2.809 0.005621 ** 
## trstep       0.15721    0.06518   2.412 0.017065 *  
## trstlgl      0.97850    0.13509   7.243 2.09e-11 ***
## trstplc     -0.33134    0.12547  -2.641 0.009141 ** 
## trstun      -0.58533    0.09133  -6.409 1.77e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.489 on 151 degrees of freedom
## Multiple R-squared:  0.7648, Adjusted R-squared:  0.7445 
## F-statistic: 37.76 on 13 and 151 DF,  p-value: < 2.2e-16

## [1] "For impcntr"
## [1] "Mean squared prediction error: 0.327"
## [1] "                    R-squared: 0.778"
## [1] "           Adjusted R-squared: 0.333"

ShowCoeffSummary(the_mod$coeffs, indicator_table)

impcntr linear model coefficients Coefficients shown in descending order of absolute value
variable coefficient std_err t_val p_val name
trstlgl 0.979 0.135 7.243 0.000 Trust in the legal system
trstun -0.585 0.091 -6.409 0.000 Trust in the United Nations
polintr 0.358 0.062 5.770 0.000 How interested in politics
trstplc -0.331 0.125 -2.641 0.009 Trust in the police
imbgeco -0.330 0.073 -4.504 0.000 Immigration bad or good for country’s economy
stfedu 0.311 0.080 3.866 0.000 State of education in country nowadays
health 0.237 0.058 4.071 0.000 Subjective general health
stfgov2 -0.226 0.076 -2.962 0.004 How satisfied with the national government
imwbcnt -0.192 0.082 -2.337 0.021 Immigrants make country worse or better place to live
stfhlth -0.185 0.066 -2.809 0.006 State of health services in country nowadays
pplhlp 0.180 0.091 1.965 0.051 Most of the time people helpful or mostly looking out for themselves
trstep 0.157 0.065 2.412 0.017 Trust in the European Parliament
eisced2 -0.079 0.053 -1.475 0.142 Highest level of education, ES ISCED
print(the_mod$paircc)

the_mod_sum <- summary(the_mod$model)

The model explains 74.5% of variation.

Clustering the countries

We investigate if there are any clusters of countries in the data using as clusterig elements observations identified by country name and year of ess survey where all selected ess indicators

Preprocess

# create subset with ESS indicators only
drop_cols <- which(colnames(ds_ss_avg_cntry_ind) %in% c((indicator_table %>% 
  filter(source %in% c("WB", "UNHCR")))$indicator))
mydata <- ds_ss_avg_cntry_ind[,-drop_cols]
dim(mydata)
## [1] 420  20
summary(mydata)
##     ess_year     cntry_name           eisced2          euftf2     
##  Min.   :2002   Length:420         Min.   :2.272   Min.   :3.818  
##  1st Qu.:2005   Class :character   1st Qu.:3.440   1st Qu.:4.758  
##  Median :2009   Mode  :character   Median :3.670   Median :5.151  
##  Mean   :2009                      Mean   :3.649   Mean   :5.211  
##  3rd Qu.:2013                      3rd Qu.:3.878   3rd Qu.:5.615  
##  Max.   :2016                      Max.   :5.034   Max.   :7.332  
##                                    NA's   :233     NA's   :233    
##      happy           health         imbgeco         impcntr     
##  Min.   :5.405   Min.   :1.685   Min.   :3.017   Min.   :1.763  
##  1st Qu.:6.825   1st Qu.:1.972   1st Qu.:4.443   1st Qu.:2.374  
##  Median :7.396   Median :2.186   Median :4.893   Median :2.550  
##  Mean   :7.272   Mean   :2.193   Mean   :4.842   Mean   :2.576  
##  3rd Qu.:7.819   3rd Qu.:2.370   3rd Qu.:5.328   3rd Qu.:2.759  
##  Max.   :8.366   Max.   :2.915   Max.   :6.170   Max.   :3.558  
##  NA's   :233     NA's   :233     NA's   :233     NA's   :233    
##     imueclt         imwbcnt         polintr          pplhlp     
##  Min.   :3.150   Min.   :2.814   Min.   :2.158   Min.   :2.998  
##  1st Qu.:5.023   1st Qu.:4.395   1st Qu.:2.451   1st Qu.:4.228  
##  Median :5.571   Median :4.788   Median :2.673   Median :4.760  
##  Mean   :5.462   Mean   :4.836   Mean   :2.644   Mean   :4.859  
##  3rd Qu.:6.007   3rd Qu.:5.395   3rd Qu.:2.790   3rd Qu.:5.660  
##  Max.   :7.322   Max.   :6.617   Max.   :3.148   Max.   :6.984  
##  NA's   :233     NA's   :233     NA's   :233     NA's   :233    
##      stfeco          stfedu         stfgov2         stfhlth     
##  Min.   :1.368   Min.   :3.126   Min.   :1.553   Min.   :2.482  
##  1st Qu.:3.476   1st Qu.:4.917   1st Qu.:3.528   1st Qu.:4.157  
##  Median :4.452   Median :5.520   Median :4.251   Median :5.397  
##  Mean   :4.484   Mean   :5.581   Mean   :4.217   Mean   :5.286  
##  3rd Qu.:5.367   3rd Qu.:6.267   3rd Qu.:4.956   3rd Qu.:6.302  
##  Max.   :7.937   Max.   :8.029   Max.   :6.580   Max.   :7.667  
##  NA's   :233     NA's   :233     NA's   :233     NA's   :233    
##      trstep         trstlgl         trstplc          trstun     
##  Min.   :2.587   Min.   :1.898   Min.   :2.035   Min.   :2.596  
##  1st Qu.:4.039   1st Qu.:4.189   1st Qu.:5.104   1st Qu.:4.711  
##  Median :4.579   Median :5.076   Median :6.062   Median :5.100  
##  Mean   :4.477   Mean   :5.062   Mean   :5.885   Mean   :5.159  
##  3rd Qu.:4.904   3rd Qu.:5.992   3rd Qu.:6.647   3rd Qu.:5.549  
##  Max.   :5.963   Max.   :7.596   Max.   :8.199   Max.   :6.859  
##  NA's   :233     NA's   :233     NA's   :233     NA's   :233
# remove records with NAs
myDataClean = na.omit(mydata)
dim(myDataClean)
## [1] 187  20
summary(myDataClean)
##     ess_year     cntry_name           eisced2          euftf2     
##  Min.   :2002   Length:187         Min.   :2.272   Min.   :3.818  
##  1st Qu.:2006   Class :character   1st Qu.:3.440   1st Qu.:4.758  
##  Median :2010   Mode  :character   Median :3.670   Median :5.151  
##  Mean   :2009                      Mean   :3.649   Mean   :5.211  
##  3rd Qu.:2012                      3rd Qu.:3.878   3rd Qu.:5.615  
##  Max.   :2016                      Max.   :5.034   Max.   :7.332  
##      happy           health         imbgeco         impcntr     
##  Min.   :5.405   Min.   :1.685   Min.   :3.017   Min.   :1.763  
##  1st Qu.:6.825   1st Qu.:1.972   1st Qu.:4.443   1st Qu.:2.374  
##  Median :7.396   Median :2.186   Median :4.893   Median :2.550  
##  Mean   :7.272   Mean   :2.193   Mean   :4.842   Mean   :2.576  
##  3rd Qu.:7.819   3rd Qu.:2.370   3rd Qu.:5.328   3rd Qu.:2.759  
##  Max.   :8.366   Max.   :2.915   Max.   :6.170   Max.   :3.558  
##     imueclt         imwbcnt         polintr          pplhlp     
##  Min.   :3.150   Min.   :2.814   Min.   :2.158   Min.   :2.998  
##  1st Qu.:5.023   1st Qu.:4.395   1st Qu.:2.451   1st Qu.:4.228  
##  Median :5.571   Median :4.788   Median :2.673   Median :4.760  
##  Mean   :5.462   Mean   :4.836   Mean   :2.644   Mean   :4.859  
##  3rd Qu.:6.007   3rd Qu.:5.395   3rd Qu.:2.790   3rd Qu.:5.660  
##  Max.   :7.322   Max.   :6.617   Max.   :3.148   Max.   :6.984  
##      stfeco          stfedu         stfgov2         stfhlth     
##  Min.   :1.368   Min.   :3.126   Min.   :1.553   Min.   :2.482  
##  1st Qu.:3.476   1st Qu.:4.917   1st Qu.:3.528   1st Qu.:4.157  
##  Median :4.452   Median :5.520   Median :4.251   Median :5.397  
##  Mean   :4.484   Mean   :5.581   Mean   :4.217   Mean   :5.286  
##  3rd Qu.:5.367   3rd Qu.:6.267   3rd Qu.:4.956   3rd Qu.:6.302  
##  Max.   :7.937   Max.   :8.029   Max.   :6.580   Max.   :7.667  
##      trstep         trstlgl         trstplc          trstun     
##  Min.   :2.587   Min.   :1.898   Min.   :2.035   Min.   :2.596  
##  1st Qu.:4.039   1st Qu.:4.189   1st Qu.:5.104   1st Qu.:4.711  
##  Median :4.579   Median :5.076   Median :6.062   Median :5.100  
##  Mean   :4.477   Mean   :5.062   Mean   :5.885   Mean   :5.159  
##  3rd Qu.:4.904   3rd Qu.:5.992   3rd Qu.:6.647   3rd Qu.:5.549  
##  Max.   :5.963   Max.   :7.595   Max.   :8.199   Max.   :6.859
#remove metadata fields and scale the rest
scaled_data = as.matrix(scale(myDataClean[,-which(colnames(myDataClean) %in% metadata_columns)]))
rownames(scaled_data) <- paste(myDataClean$cntry_name, myDataClean$ess_year)

PCA

Experiment reducing dimensionality with Primary Component Analysis

# PCA
pc <- prcomp(scaled_data)
PoV <- pc$sdev^2/sum(pc$sdev^2) # proportion of variance
CumPoV <- cumsum(PoV)
n_CumPoV_085 <- length(which(CumPoV <=0.85))
print(paste(n_CumPoV_085, "principal components explain 85% of variance"))
## [1] "5 principal components explain 85% of variance"
p_var <- ggplot(data=data.frame(PoV=PoV, CumPoV=CumPoV, n=1:length(CumPoV)), aes(x=n)) +
  geom_bar(aes(y=PoV), fill="red", stat="identity") +
  geom_line(aes(y=CumPoV), color="black") +
  geom_hline(yintercept = 0.85, linetype="dashed") +
  geom_vline(xintercept = n_CumPoV_085, linetype="dashed") +
  labs(title="Principal Componen Analysis - Proportion of Variance",
       subtitle="Individual and cumulative variance explained (using 'prcomp')") + 
  xlab("Principal Component")

MyPrintInteractive(p_var, plot_interactive) 
knitr::kable(pc$rotation, digits = 2, caption = "Primary componets vs indicators")
Primary componets vs indicators
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
eisced2 -0.04 -0.02 0.28 -0.78 0.14 -0.30 0.01 -0.15 0.08 -0.21 0.10 0.03 -0.26 -0.07 0.10 -0.05 0.19 0.07
euftf2 0.12 -0.40 -0.34 -0.01 0.52 0.09 0.12 -0.24 -0.04 0.26 0.34 -0.14 -0.12 0.06 -0.24 0.15 0.25 -0.08
happy -0.29 0.10 0.06 0.22 0.17 -0.01 -0.09 -0.15 -0.10 0.25 0.08 -0.32 -0.11 -0.37 0.39 -0.54 0.14 0.08
health 0.20 -0.19 -0.03 -0.28 -0.48 0.51 -0.35 -0.23 0.05 0.13 0.21 -0.14 -0.05 -0.15 -0.09 -0.15 -0.19 -0.08
imbgeco -0.23 -0.34 0.11 0.07 0.06 0.01 -0.24 0.30 0.46 -0.24 0.34 0.05 0.45 0.08 0.00 -0.19 0.19 0.04
impcntr 0.17 0.44 -0.18 -0.10 0.00 0.05 0.15 -0.16 0.66 0.28 -0.10 -0.16 0.10 0.22 0.01 -0.07 0.10 0.23
imueclt -0.23 -0.36 0.06 0.14 -0.23 0.01 -0.07 -0.23 0.29 0.11 -0.41 0.04 -0.18 -0.07 0.28 0.38 0.34 -0.18
imwbcnt -0.24 -0.38 0.10 0.02 0.06 -0.26 0.07 -0.04 0.23 0.20 -0.21 -0.06 -0.21 0.12 -0.28 -0.23 -0.61 0.12
polintr 0.22 0.10 -0.29 0.15 -0.19 -0.57 -0.42 0.20 0.14 0.00 0.19 -0.15 -0.32 -0.08 -0.10 0.03 0.07 -0.21
pplhlp -0.28 0.12 0.23 -0.10 -0.16 -0.11 0.19 0.08 -0.05 0.13 0.13 -0.63 0.27 -0.18 -0.27 0.38 -0.01 -0.10
stfeco -0.29 0.08 -0.09 -0.18 0.11 0.15 -0.26 0.33 -0.09 0.28 0.19 -0.05 -0.17 0.43 0.44 0.29 -0.22 0.05
stfedu -0.27 0.15 -0.16 -0.13 -0.05 -0.30 -0.19 -0.38 -0.11 0.35 0.10 0.45 0.44 -0.06 -0.05 0.02 -0.09 -0.20
stfgov2 -0.26 0.13 -0.19 -0.21 0.17 0.23 -0.27 0.38 0.00 0.14 -0.38 0.13 -0.10 -0.30 -0.43 -0.03 0.21 0.12
stfhlth -0.24 0.22 0.04 0.17 0.18 0.09 -0.45 -0.47 -0.05 -0.43 -0.08 -0.21 -0.06 0.32 -0.23 0.07 -0.01 0.07
trstep -0.13 -0.11 -0.68 -0.19 0.02 -0.04 0.12 -0.04 0.04 -0.41 -0.14 -0.20 0.18 -0.24 0.26 0.06 -0.26 0.04
trstlgl -0.29 0.22 -0.05 -0.02 0.02 0.21 0.29 0.05 0.20 -0.14 0.12 0.08 -0.25 0.08 -0.08 -0.20 -0.05 -0.73
trstplc -0.29 0.13 -0.06 0.20 -0.19 0.07 0.19 -0.09 0.15 -0.10 0.45 0.30 -0.32 -0.28 -0.10 0.24 -0.05 0.44
trstun -0.25 -0.12 -0.27 -0.08 -0.47 -0.10 0.21 0.02 -0.29 0.02 0.00 -0.06 -0.05 0.45 -0.14 -0.30 0.37 0.17
# plot two first PCs
library(ggfortify)
# autoplot(pc)

# Take the principal components needed for 85%
comp <- data.frame(pc$x[,1:n_CumPoV_085])

# Plot cross correlations
ggpairs(data=comp) +
  labs(title="Correlations between Primary Components",
       subtitle="Zero correlations - as expected")

K-means clustering

Identifying potential clusters in country / ess year indicator rows using the PCA analysis results.

# Use the data in PCA space for k-means clustering

kmeans_data <- comp

# fix randomization
set.seed(123)

# find clusters, set max value to be count of countries in the dataset
k.max <- length(unique(myDataClean$cntry_name))
print(paste("Set upper bound to clusters the count of countries in the dataset", k.max))
## [1] "Set upper bound to clusters the count of countries in the dataset 28"
# Elbow Method for visually finding the optimal number of clusters
wss <- sapply(1:k.max, 
              function(k){
                kmeans(kmeans_data, k, nstart=50,iter.max = 15)$tot.withinss
              })
ggplot(data = data.frame(n=1:k.max, wss=wss), aes(x=n, y=wss)) +
  geom_line() +
  geom_point() +
  labs(title="Elbow method",
       substitle="Visually identify an 'elbow' in the plot") +
  xlab("Number of clusters K") +
  ylab("Total within-clusters sum of squares")

# NbClust - vote on kmeans ==> use this as the outcome
library(NbClust)
nb <- NbClust(kmeans_data, 
              diss=NULL, 
              distance = "euclidean", 
              min.nc=2, 
              max.nc=k.max, 
              method = "kmeans", 
              index = "all", 
              alphaBeale = 0.1)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 3 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 3 proposed 22 as the best number of clusters 
## * 1 proposed 25 as the best number of clusters 
## * 3 proposed 26 as the best number of clusters 
## * 1 proposed 28 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
print(sort((nb$Best.nc[1,])))
##     Hubert     Dindex       Frey         CH Silhouette       Duda 
##          0          0          1          2          2          2 
##   PseudoT2      Beale    McClain      Scott    Marriot     TrCovW 
##          2          2          2          3          3          3 
##     TraceW       Ball   Hartigan  Ratkowsky PtBiserial    SDindex 
##          3          3          7          7          7          8 
##         KL   Friedman       Dunn         DB        CCC      Rubin 
##         22         22         22         25         26         26 
##       SDbw     Cindex 
##         26         28
ggplot(data=data.frame(n=nb$Best.nc[1,]), aes(n)) +
  geom_histogram(binwidth = 1) +
  labs(title="cluster count using several methods",
       subtitle="As majority vote determined by 'NbClust' function")

n_clusters <- length(unique(nb$Best.partition))
print(paste("Majority vote decision:", n_clusters, "clusters"))
## [1] "Majority vote decision: 2 clusters"

Visual identification is always somewhat vague. Fortunately there is an algorithm that can do it independently - even if one can always havwe a debate on whether the selection the algorithm made was right. This looks good enough for this purpose - at least for now.

Visualize groups in PCA space

Annoyingly, the two plots do not use the same colours for clusters.

# wrap results into a dataset
ds_clusters <- data.frame(cluster = as.factor(nb$Best.partition),
                          cntry_name = myDataClean$cntry_name,
                          ess_year = myDataClean$ess_year) 

# show in PCA space

library(cluster)
pam_obj <- pam(scaled_data, n_clusters)
autoplot(pam_obj, frame.type = 'norm', label=TRUE, label.size=2.5, label.alpha=0.6) +
  labs(title="Countries classified by ESS survey",
       subtitle="Each classification identified")

autoplot(prcomp(scaled_data), data=ds_clusters, colour='cluster', 
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, 
         loadings.label.size = 3) +
  labs(title="Countries classified by ESS survey",
       subtitle="Eigenvectors of original indicators shown")

Countries split to classes

The Nordic countries are in the same cluster, which is not that surprising given their relatively common values and ways in their respective societies. Other participants in that cluster are perhaps not as clear or obvious - at least to me.

And then there is the rest of Europe in the other cluster.

(I did at some point of time the same clustering excercise using scaled indicators and came up with the same clusters as with PCA)

# classify countries based on the clusters they have most often been based on survey scores

cntry_clusters <- ds_clusters %>% 
  group_by(cntry_name, cluster) %>% 
  summarise(n=n()) %>% 
  arrange(desc(n)) %>%
  filter(!duplicated(cntry_name)) %>%
  arrange(cluster)

p <- CreateEuroMap(dsin = cntry_clusters %>% ungroup(), 
                   ind_name = "cluster",
                   txt_title = "Countries clustered by indicators",
                   txt_subtitle = "Based on the cluster a country most often has been in",
                   txt_caption = "ESS surveys 2002-2016")
MyPrintInteractive(p, plot_interactive) 
# how does clustering changed annually

clustered_years <- ds_clusters %>% 
  group_by(ess_year, cluster) %>% 
  summarise(n=n())

p <- ggplot(data = clustered_years, aes(x=ess_year, y=n, fill=cluster)) +
  geom_bar_interactive(stat = "identity", 
                       position = "fill",
                       aes(tooltip=paste0(ess_year, ' - cluster ', cluster, '\nn=', n))) +
  geom_hline(yintercept = c(0.25, 0.5, 0.75), linetype="dashed") + 
  labs(title="Relative size of groups",
       subtitle = "As classified by country/survey granularity",
       txt_caption = "ESS surveys 2002-2016") +
  xlab("") + ylab("")
MyPrintInteractive(p, plot_interactive) 

Looking into details of how do the two grups differ we can see that in most of the indicators countries in cluster 2 give higher scores which roughly translate to “more” or “better” — with a couple of noteworthy exceptions

  • health – low value is good and high value is poor health
  • impcntrl – low value means more immigration and high less immigration from poor countriesThere seems
# compare the indicators between classes

p <- ggplot(data=ds_ss_avg %>% 
              filter(indicator %in% (indicator_table %>% filter(source=="ESS"))$indicator) %>%
              group_by(indicator) %>%
              mutate(value_s = scale(value)) %>%
              ungroup() %>%
              left_join(cntry_clusters %>%  
                          select(cntry_name, cluster), 
                        by="cntry_name"),
            aes(x=indicator, y=value_s, fill=cluster)) +
  geom_boxplot_interactive(aes(tooltip = paste(indicator,'\nclust ', cluster))) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5)) +
  xlab("") + ylab("") +
  labs(title="Differences between groups of countries",
       subtitle="Variables scaled to same range")
MyPrintInteractive(p, plot_interactive) 

Using the same classification on selected World Bank and UNHCR seems to be supportive of the grouping even though no formal clustering based on these values has been conducted - at least not yet.

# World Bank
p <- ggplot(data=ds_ss_avg %>% 
         filter(indicator %in% (indicator_table %>%
                                  filter(indicator %in% pct_indicators,
                                         source=="WB"))$indicator) %>%
         left_join(cntry_clusters %>%  
                     select(cntry_name, cluster), 
                   by="cntry_name"),
       aes(x=indicator, y=value, fill=cluster)) +
  geom_boxplot_interactive(aes(tooltip = paste(indicator,'\nclust ', cluster))) + 
  # geom_boxplot_interactive() + 
  xlab("") + ylab("") +
  labs(title="Differences between groups of countries",
       subtitle="WB Variables", 
       caption = "World Bank") +
  coord_flip()
MyPrintInteractive(p, plot_interactive) 
# UNHCR
p <- ggplot(data=ds_ss_avg %>% 
         filter(indicator %in% (indicator_table %>% 
                                  filter(indicator %in% pct_indicators,
                                         source=="UNHCR"))$indicator,
                !is.na(value)) %>%
         left_join(cntry_clusters %>%  
                     select(cntry_name, cluster), 
                   by="cntry_name"),
       aes(x=indicator, y=value+1, fill=cluster)) +
  geom_boxplot_interactive(aes(tooltip = paste(indicator,'\nclust ', cluster))) + 
#  geom_boxplot_interactive() + 
  scale_y_log10() +
  xlab("") + ylab("") +
  labs(title="Differences between groups of countries",
       subtitle="UNHCR Variables on logarithmic scale (offset +1%)",
       caption="UNHCR") +
  coord_flip()
MyPrintInteractive(p, plot_interactive) 

Visual box plot comparisons of selected UNHCR and WB indicators seem to support the clustering clustering based on ESS data.

  • Cluster 2 countries appear more urbanized than cluster 1
  • Proportion of population born in another country is higher in cluster 2 even though there is quite some overlap in distributions
  • There is higher percentage of people active in workforce in cluster 2
  • Government health and education expenditures vs. GDP are higher in cluster 2
  • GDP growth (and per capita growth) is higher in cluster 1 countries but there is significant overlap with cluster 2 distribution
  • Significantly higher share of population is using Internet in cluster 2 compared to cluster 1
  • Tax revenues as % of GDP are higher in cluster 2
  • Personal remittances received (% of GDP) is higher in cluster 1

Summary

ESS data is an interesting source for getting insights into the recent history and ongoing present of Europe from data.

In this post perhaps the most interesting – but perhaps not surprising – findings were

Linking between ESS and other data may give further insights and I will probably return to this topic later.


References

A real serious study on the topic (I’ve yet to read myself):

ESS

Citation of data:

  • European Social Survey Cumulative File, ESS 1-7 (2016). Data file edition 1.0. NSD - Norwegian Centre for Research Data, Norway - Data Archive and distributor of ESS data for ESS REIC.
  • European Social Survey Round 8 Data (2016). Data file edition 2.0. NSD - Norwegian Centre for Research Data, Norway - Data Archive and distributor of ESS data for ESS ERIC

Citation of documentation:

  • European Social Survey (2016). ESS 1-7, European Social Survey Cumulative File, Study Description. Bergen: NSD - Norwegian Centre for Research Data for ESS ERIC.
  • European Social Survey (2016): ESS8- 2016 Documentation Report. Edition 2.0. Bergen, European Social Survey Data Archive, NSD - Norwegian Centre for Research Data for ESS ERIC

Distributor of Data

NSD - Norwegian Centre for Research Data,
Harald Hårfagresgt. 29 N-5007 Bergen, Norway.
Phone: +47 55 58 21 17
Fax: +47 44 58 96 50
e-mail: nsd@nsd.no
Web: http://www.nsd.no/english

ESS: essdata@nsd.no
ESS: www.europeansocialsurvey.org

World Bank

World Development Indicators https://data.worldbank.org/products/wdi

The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

Excel file was downloaded and used

UNHCR

Population Statistics / Time Series http://popstats.unhcr.org/en/time_series

On this page, each row of data represents the information about UNHCR’s populations of concern for a given year and country of residence and/or origin. Data is presented as a yearly time series across the page. In the 2017 data, figures between 1 and 4 have been replaced with an asterisk (*). These represent situations where the figures are being kept confidential to protect the anonymity of individuals. Such figures are not included in any totals”

Csv extract of all data was used.

Technical notes

Caption and figure numbering is produced as described by Norbert Köhler in his blog post “R Markdown: How to number and reference tables”

Top matter definitions

Change Notes

2018-10-18 - started to update the look and feel

Partly done in the early sections of the report, hopefully will find time a bit later to revise the rest as well.